Description: The source extension feature in IWX v5.x allows users to integrate custom code written in Java that can be applied to the source column data before it lands in the data lake. This feature is available for all the source types in Infoworks
Some use cases for source extension are listed below:
a) Data maskin/obfuscation to allow users to secure the data.
b) Incremental data generation before ingesting the data into the data lake.
In the below use case, you could generate an auto incremental timestamp column if your source data does not have any incremental source timestamp data.
import java.io.Serializable; import java.sql.Timestamp; import java.util.Date; import java.util.function.Function; public class Timestamp_val implements Function<String, Timestamp>, Serializable { @Override public Timestamp apply(String input) { return new Timestamp(new Date().getTime()); } }
Pre-requisites
i) The class that you create should implement the Java Function and Serializable interfaces as shown above.
ii) Serializing an object will convert its state to a byte stream so that the byte stream can be reverted back into a copy of the object. A Java object is serializable if its class implements the java.io.Serializable interface and these objects are sent to the Spark executors during the ingestion job.
iii) As the class that you create implements the java Function interface, it can have only one method which is, apply.
The code goes inside the apply method.
https://www.geeksforgeeks.org/function-interface-in-java-with-examples/
iv) In the above example, the class takes the input as String and will return the output of Timestamp datatype.
Steps to configure Source Extension:
a) Build the above class and any dependencies you have into a jar file using Eclipse/Intellij IDE.
b) Login to IWX web UI and go to Admin>Extensions>Source Extension.
c) Click on Create Source Extension
d) Provide the required details as shown below.
e) Upload the jar from your local machine and provide the Alias as apply (which is the function inside your class) and the Class Name.
f) Click Save.
g) Go to the Source where you want to use this source extension.
h) Click on the Source Setup option for the source and go to Source Extensions section and select the name of the extension/function you have created and click Save.
i) Click on the Tables tab> Select the table for which you want to use this extension.
j) Run Metadata crawl for the table and click on the Configuration option.
k) Click the Details option right next to the table.
l) Use the Add Column option to add a new timestamp column. Provide the name for the column and select the class and the function under the Transform Function option.
m) As the output of this function would be a timestamp value (current timestamp) that would be stored in the newly added column, you need to provide the format for the timestamp data i.e. yyyy-MM-dd HH:mm:ss.SSS
n) Click on the Save Schema option.
o) To check if your Source extension is working, click on the Sample Data option and it should show the sample data for the newly added/derived column, current_ingestion_timestamp.
p) Run the table ingestion and you should see the value (the timestamp when the record is ingested into the data lake) for this column in the data lake.
Note:
a) You could use this newly added column as a watermark column or as a partition column during the ingestion process and also in the downstream Sync to Target Jobs.
b) In this use case, I have used a CSV source to add an extra column to the CSV table. This column will have the timestamp when the record/row in the CSV file data is ingested into the data lake.
c) Each class can have only one method i.e apply.
d) You can create multiple source extensions and use multiple functions for multiple columns inside a single source table. All the required extensions should be added under the Source>Source Extensions section.
Applicable Infoworks versions:
v5.x