Databricks jobs, for ingestion, submitted by infoworks DF can get timed out if it runs more than a configurable limit. The messages in job log which indicate the timeout are the following

[INFO] 2020-10-20 10:39:44,500 [pool-1-thread-1] io.infoworks.platform.databricks.util.MetaDBUtil:114 :: Marked databricks run '8231' as 'TIMEDOUT'
[INFO] 2020-10-20 10:39:44,501 [pool-1-thread-1] io.infoworks.platform.databricks.monitor.DatabricksJobMonitor:83 :: runId:8231, State: {"life_cycle_state":"TERMINATING","result_state":"TIMEDOUT","state_message":"Terminating the spark cluster"}

In this scenario, one can increase the parallelism and try to run the jobs below the timeout or increase the timeout value itself or even a mix of both.

Instructions to optimise job performance

Step 1: Select Split By Column
Step 2: Increase the number of max connections to number of distinct splits
Step 3: Increase number of max allowed workers in the cluster template.

Step 1: Select Split By Column

You can use the split by key as show here
This will allow you to crawl the table in parallel using multiple connections. Please refer to Optimisation Configuration in https://docs.infoworks.io/data-ingestion/ingestion-process#configuring-tables for more details. Split by will be effective if you choose a column which distributes the data evenly.

Step 2: Increase the number of max connections to number of distinct splits

Increase the max connections to the source in the below setting

For example, if the split by column splits the tables into 24 splits, you can give the max connections as 24 so that each connection can bring the data from each split.

Step 3: Increase number of max allowed workers in the cluster template.

Increase number of workers to the cluster template you are using for ingestion.

For example, if you have 24 splits/connections, the to enable full parallelism you need 24 vcpus which is equal to 3 Standard_DS13_v2 nodes. This is to ensure that all the data coming from 24 parallel connections is processed in parallel by serving each connection with a vpcu.

Alternatively you can also increase the timeout value to a higher value

The current timeout be seen from the following file and can be edited in the same file to a higher value

cat /opt/infoworks/conf/databricks_ingestion_defaults.json | grep timeout_seconds

The below property can be set to a higher value (example: 20 hrs)
"timeout_seconds": 72000

Applicable versions: 3.2, 4.2, 4.2.0.1

Infoworks

Ingestion job TIMEDOUT after running for long time Print

Instructions to optimise job performance

Step 1: Select Split By Column

Step 2: Increase the number of max connections to number of distinct splits

Step 3: Increase number of max allowed workers in the cluster template.

Ingestion job TIMEDOUT after running for long time Print

Instructions to optimise job performance

Step 1: Select Split By Column

Step 2: Increase the number of max connections to number of distinct splits

Step 3: Increase number of max allowed workers in the cluster template.

Related Articles