Problem Description


Pipeline build job in IWX on EMR fails with the below ERROR in the yarn logs.


21/03/15 13:41:38 WARN YarnAllocator: Container from a bad node: container_e27_1615796965776_0239_02_000012 on host: ip-10-45-74-232.ec2.internal. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137

Container exited with a non-zero exit code 137

Killed by an external signal

.

21/03/15 13:41:38 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 8 for reason Container from a bad node: container_e27_1615796965776_0239_02_000012 on host: ip-10-45-74-232.ec2.internal. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137

Container exited with a non-zero exit code 137

Killed by an external signal

.

21/03/15 13:41:38 ERROR YarnClusterScheduler: Lost executor 8 on ip-10-45-74-232.ec2.internal: Container from a bad node: container_e27_1615796965776_0239_02_000012 on host: ip-10-45-74-232.ec2.internal. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137

Container exited with a non-zero exit code 137

Killed by an external signal

.

21/03/15 13:41:38 WARN TaskSetManager: Lost task 0.3 in stage 14.0 (TID 9264, ip-10-45-74-232.ec2.internal, executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Container from a bad node: container_e27_1615796965776_0239_02_000012 on host: ip-10-45-74-232.ec2.internal. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137

Container exited with a non-zero exit code 137



Root cause

This is an issue from the EMR side and this could happen if the spark.driver.memory or the spark.executor.memory values are very less. When a container (Spark executor) runs out of memory, YARN automatically kills it. 


This causes a "Container killed on request. Exit code is 137" error. These errors can happen in different job stages, both in narrow and wide transformations.


https://aws.amazon.com/premiumsupport/knowledge-center/container-killed-on-request-137-emr/#:~:text=When%20a%20container%20(Spark%20executor,Exit%20code%20is%20137%22%20error. 



Solution:


Try increasing the values for spark.executor.memory and spark.driver.memory by setting the below configurations under Pipeline>Settings>Advanced Configurations.

You could check what are the current values for these parameters from the pipeline build job log or the spark-defaults. conf file on the IWX edge node.

For instance, add key for spark.executor.memory

key: spark.executor.memory
value:8g

Add another key for the spark.driver.memory

key: spark.driver.memory
value:8g

Run the pipeline build job again. This should resolve the issue.


Applicable IWX on EMR versions:

v3.1.x-emr