Add existing EMR cluster to Persistent : Infoworks

Infoworks 5.0 versions provides the ability to launched persistent cluster from Infoworks UI. However if one wants to map an existing cluster to infoworks persistent compute, one needs to follow the below steps

Steps:

Create a cluster template through infoworks UI. (The cluster created can be terminated manually from cloud or one can also stop platform service before this step to prevent cluster creation)
Run the below commands
```
cd {IW_HOME}/bin 
source env.sh
```
Add cluster details to the attached cluster_details.json (Here user must enter environment_id, cluster_name from Infoworks UI and corresponding cluster id from externally created cluster)
Run script
```
python add_cluster_details.py
```
Note: Ensure that .json file and .py file are in the same location.

Note:
When a user launches a persistent cluster using infoworks, it launched with certain configurations which can be configured in Infoworks UI and these configurations are necessary for the smooth completion of jobs. Hence we suggest user to ensure that the existing cluster (created outside infoworks) has all default cluster configurations before mapping it as Infoworks persistent compute using the above steps.

For example: If one is launching an EMR cluster (outside infoworks), one needs to ensure that is has at least the below configurations.

One can use this as a template and add additional configurations if needed.

[
   {
      "classification":"hive-site",
      "properties":{
         "javax.jdo.option.ConnectionUserName":"emradmin",
         "javax.jdo.option.ConnectionDriverName":"org.mariadb.jdbc.Driver",
         "javax.jdo.option.ConnectionPassword":"********",
         "javax.jdo.option.ConnectionURL":"jdbc:mysql://hue.xxxxxxxxx.us-east-1.rds.amazonaws.com:3306/hive?createDatabaseIfNotExist=true",
         "hive.metastore.warehouse.dir":"s3://bucket_name/hive"
      },
      "configurations":[
         
      ]
   },
   {
      "classification":"spark-defaults",
      "properties":{
         "spark:spark.executorEnv.IW_TAG":"\"EMR\"",
         "spark.sql.parquet.fs.optimized.committer.optimization-enabled":"false",
         "spark.sql.warehouse.dir":"s3://bucket_name/hive"
      },
      "configurations":[
         
      ]
   },
   {
      "classification":"spark-env",
      "properties":{
         
      },
      "configurations":[
         {
            "classification":"export",
            "properties":{
               "PYTHONPATH":"/usr/lib/spark/python/lib/py4j-src.zip:/usr/lib/spark/python/lib/pyspark.zip"
            },
            "configurations":[
               
            ]
         }
      ]
   }
]

Applicable Infoworks versions: 5.0

How to map existing EMR/Dataproc cluster to Persistent compute Print

Related Articles