Problem Description
When you are running some incremental ingestion jobs or pipeline jobs, sometimes you might see the below errors related to broadcast joins in the failed databricks logs.
org.apache.spark.sql.execution.OutOfMemorySparkException: Size of broadcasted table far exceeds estimates and exceeds limit of spark.driver.maxResultSize=4294967296. You can disable broadcasts for this query using set spark.sql.autoBroadcastJoinThreshold=-1
Solution
As suggested in the exception itself we have 2 options here, either to increase the driver max result size or disable the broadcast joins.
You can set the below-advanced configs at table level or at the pipeline level.
For ingestion tables,
key: ingestion_spark_configs
value: spark.driver.maxResultSize=8294967296
For pipelines,
key: spark.driver.maxResultSize
value: 8294967296
Applicable Infoworks Data Foundry Versions
IWX 4.x