Steps to perform Python Custom Transformation in Infoworks : Infoworks

Scenario:

If a dataset like the one mentioned below:

id,fname,lname,salary

1,Nitin,BS,10

2,Alex,P,20

3,Hrithik,R,25

And, the requirement is like he would like to increase the salary for all the employees by 5%, he could use the custom transformation feature of Infoworks.

Steps to perform:

Check if the python_custom_executable_path in conf.properties is pointing to the latest python version as Infoworks, else it will use the default python on edge node and fail.

Ex:

python_custom_executable_path=/opt/infoworks/resources/python36/bin

First, install the API egg

For 3.x version of infoworks:

$ python -m easy_install /opt/infoworks/df/python_scripts/api-1.0.egg

Processing api-1.0.egg

Copying api-1.0.egg to /opt/infoworks/resources/python36/lib/python3.6/site-packages

Adding api 1.0 to easy-install.pth file

Installed /opt/infoworks/resources/python36/lib/python3.6/site-packages/api-1.0.egg

Processing dependencies for api==1.0

Finished processing dependencies for api==1.0

For 5.x version of infoworks:

$ python -m easy_install /opt/infoworks/lib/dt/api/python/dt-extensions-python-api-1.0.egg

Processing api-1.0.egg

Copying api-1.0.egg to /opt/infoworks/resources/python36/lib/python3.6/site-packages

Adding api 1.0 to easy-install.pth file

Installed /opt/infoworks/resources/python36/lib/python3.6/site-packages/api-1.0.egg

Processing dependencies for api==1.0

Finished processing dependencies for api==1.0

Create a directory structure on the edge node as shown in the below example

mkdir -p /home/ec2-user/customtransformations/cusscripts

Where cusscripts is the directory where user can store their custom transformation python code

Write your custom transformation code, however, you could use $IW_HOME/examples/pipeline_extensions/custom_transformations.py code as reference.

Note:

Make sure you write your custom transformation considering input_dataset as java object since we use py4j to internally convert to python code.

My custom_transformation_example.py code looks like the one below:

from api.custom_transformation import CustomTransformation

from py4j.java_collections import JavaList

from pyspark.sql import SparkSession

from pyspark.sql.types import *

from pyspark.sql.functions import col, when

class CustomTransformationSample(CustomTransformation):

def __init__(self):

self._spark_session = None

self._user_properties = None

def transform(self,input_dataset_map):

for key in input_dataset_map :

input_dataset = input_dataset_map[key]

break

try :

df_columns = input_dataset.schema()

input_dataset = input_dataset.withColumn('salary_updated',input_dataset.col("salary").multiply(1.05))

except Exception as e:

raise Exception("Exception while adding 5% to salary columns {}".format(e))

return input_dataset

def initialise_context(self,spark_session,user_properties,processing_context):

self._spark_session = spark_session

self._user_properties = user_properties

Navigate to the parent directory: i.e

cd /home/ec2-user/customtransformations/

Create a setup.py file that is required to create the python egg file as below.

from setuptools import setup, find_packages

setup(

name = "cusscripts",

version = "0.1",

packages = find_packages()

)

In the above setup.py, I have provided the name as cusscripts because that is the directory that has custom transformation code and the one that will be converted to egg file.

source $IW_HOME/bin/env.sh(because we would need the python path to point to Infoworks python)

Next, navigate to cusscripts directory and do

Touch __init__.py ,to create an empty init file.

Now jump back to parent directory i.e cd /home/ec2-user/customtransformations/

And execute below command to create the egg file.

python setup.py bdist_egg

Now that you have a subdirectory called dist under which we can find the egg file.

$ python -m easy_install cusscripts-0.1-py3.6.egg

Processing cusscripts-0.1-py3.6.egg

Copying cusscripts-0.1-py3.6.egg to /opt/infoworks/resources/python36/lib/python3.6/site-packages

Adding cusscripts 0.1 to easy-install.pth file

Installed /opt/infoworks/resources/python36/lib/python3.6/site-packages/cusscripts-0.1-py3.6.egg

Processing dependencies for cusscripts==0.1

Finished processing dependencies for cusscripts==0.1

Navigate to Admin section of Infoworks UI, and click on external scripts, which shows up pipeline extensions page.

Click on Add an extension,and configure it as follows:

Choose extension type as custom transformation, execution type as python, and Give your preferred name for transformation and alias.

Provide the folder path where your egg file is located, In my case, it is /home/ec2-user/nitin/customtransformations/dist

Next, provide the class Name as follows

<python_package>.<python_module>.<class_name>

In my case, cusscripts.custom_transformation_example.CustomTransformationSample

Once you have successfully configured the pipeline extension, make it available to your corresponding domain with the help of manager artifacts option in Domain.

Create your own, custom transformation pipeline as shown below:

Configure the custom transformation node as shown below and hit save.

Add the output columns to be mapped. Add an additional column salary_updated as we have created one more column during the custom transformation (This is required for mapping).

Finally, configure the target node and build the pipeline.

You must have the data on the hive end as follows with a new column salary_updated that has 5% increase in salary.

hive> select * from Custom_transformations_schema.Custom_transformations_table;

3 Hrithik R 25 26.25 e0c39f74ecc3c8b850d5a45ea3c4aafd 2020-03-18 03:07:30.574 2020-03-18 03:07:30.574 I 0

2 Alex P 20 21.0 52bfd5a747947ece3720d283edbc3954 2020-03-18 03:07:30.574 2020-03-18 03:07:30.574 I 0

1 Nitin BS 10 10.5 a17c95d9db02cf15c1a4e352edee4e4c 2020-03-18 03:07:30.574 2020-03-18 03:07:30.574 I 0

Time taken: 10.489 seconds, Fetched: 3 row(s)

hive> describe Custom_transformations_schema.Custom_transformations_table;

id int

fname string

lname string

salary int

salary_updated double

ziw_row_id string

ziw_created_timestamp timestamp

ziw_updated_timestamp timestamp

ziw_status_flag string

ziw_sec_p int

Steps to perform Python Custom Transformation in Infoworks Print

Related Articles