Problem Description:


Ingestion job fails with the below error in Infoworks DataFoundry 3.X,4.X,

Caused by: java.util.concurrent.ExecutionException: org.apache.spark.sql.execution.OutOfMemorySparkException: Size of broadcasted table far exceeds estimates and exceeds limit of spark.driver.maxResultSize=4294967296. You can disable broadcasts for this query using set spark.sql.autoBroadcastJoinThreshold=-1
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:206)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:182)
    ... 227 more


Root Cause:


This is due to a limitation with Spark’s size estimator.

If the estimated size of one of the DataFrames is less than the autoBroadcastJoinThreshold, spark may use BroadCastHashJoin to perform the join. If the available nodes do not have enough resources to accommodate the broadcast DataFrame, your job fails due to an out-of-memory error.


Solution:


To resolve this we can increase the value of spark.driver.maxResultSize by setting below advanced configs at the table or the source level.
key: ingestion_spark_configs
value:  
spark.driver.maxResultSize=8294967296


Applicable versions:

IWX 3.2,4.0,4.2.