Description:


The pipeline builds failing with OutOfMemoryError: GC overhead limit exceeded error. Sample error log looks like below,


20/10/13 13:30:22 ERROR Utils: uncaught error in thread task-scheduler-preemption, stopping SparkContext
java.lang.OutOfMemoryError: GC overhead limit exceeded
20/10/13 13:30:22 WARN HeartbeatReceiver: Removing executor 0 with no recent heartbeats: 559221 ms exceeds timeout 120000 ms
20/10/13 13:30:22 ERROR FileFormatWriter: Aborting job bdc149a8-1a84-4595-b94a-1e8144ce22fb.
java.lang.OutOfMemoryError: GC overhead limit exceeded
20/10/13 13:30:22 WARN TransportChannelHandler: Exception in connection from /10.100.247.19:45742
java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.io.ObjectStreamClass.readNonProxy(ObjectStreamClass.java:805)
    at java.io.ObjectInputStream.readClassDescriptor(ObjectInputStream.java:925)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1914)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1808)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2099)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:465)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:423)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)


Root Cause:


This happens if there are more records go into a single task because of the skewness of the data.


Solution:


Choosing the proper partition key helps here. Even after partitioning the data, the job fails with the same error then we can use bigger capacity(high RAM) machines in the cluster templates.


Applicable versions:


IWX 3.2 onwards