Problem Description

When you are running some incremental ingestion jobs or pipeline jobs, sometimes you might see the below errors related to broadcast joins in the failed databricks logs. 

org.apache.spark.sql.execution.OutOfMemorySparkException: Size of broadcasted table far exceeds estimates and exceeds limit of spark.driver.maxResultSize=4294967296. You can disable broadcasts for this query using set spark.sql.autoBroadcastJoinThreshold=-1


As suggested in the exception itself we have 2 options here, either to increase the driver max result size or disable the broadcast joins.

You can set the below-advanced configs at table level or at the pipeline level.

For ingestion tables,

key: ingestion_spark_configs

value: spark.driver.maxResultSize=8294967296  

For pipelines,

key: spark.driver.maxResultSize

value: 8294967296 

Applicable Infoworks Data Foundry Versions

IWX 4.x