Problem Description


Pipeline export to BigQuery fails with Java Heap Space erros while running the below hive query.


select max(ziw_updated_timestamp) from `gcp_tmp_cust_s`.`citi_mix_aapscm_na_trn_dext_frciti`

[INFO] 2019-04-04 11:14:02,892 [MaintenanceTimer-1-thread-1] com.mongodb.diagnostics.logging.SLF4JLogger:71 :: Closed connection [connectionId{localValue:2, serverValue:15022302}] to lp000xshdw0002.federated.fds:27017 because it is past its maximum allowed idle time.

[ERROR] 2019-04-04 11:31:21,923 [pool-3-thread-1] infoworks.tools.hive.HiveUtils:466 :: Error while trying to execute hive queries java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1553724495223_9558_1_00, diagnostics=[Vertex vertex_1553724495223_9558_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: citi_mix_aapscm_na_trn_dext_frciti initializer failed, vertex=vertex_1553724495223_9558_1_00 [Map 1], java.lang.RuntimeException: serious problem

at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1277)

at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1304)

at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:311)

at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:413)

at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:155)

at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:273)

at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:266)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)

at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:266)d

at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space


Root cause: 


This issue occurs if the query fails when it is tries to generate ORC splits. The default ORC split strategy is HYBRID from the hive end.


The HYBRID mode reads the footers for all files if there are fewer files than expected mapper count, switching over to generating 1 split per file if the average file sizes are smaller than the default HDFS blocksize. ETL strategy always reads the ORC footers before generating splits, while the BI strategy generates per-file splits fast without reading any data from HDFS.



Solution:


a) Login to Infoworks edge node using Infoworks user.

b) Go to $IW_HOME/conf directory

c) Open conf.properties file and append the below hive configuration to the end of hiveConfigurationVariables

     hive.exec.orc.split.strategy=BI


Before change


hiveConfigurationVariables=hive.execution.engine=tez;hive.auto.convert.join=true;hive.insert.into.multilevel.dirs=true;hive.mapred.supports.subdirectories=true;mapred.input.dir.recursive=true;hive.exec.parallel=true;hive.merge.mapfiles=true;


After change


hiveConfigurationVariables=hive.execution.engine=tez;hive.auto.convert.join=true;hive.insert.into.multilevel.dirs=true;hive.mapred.supports.subdirectories=true;mapred.input.dir.recursive=true;hive.exec.parallel=true;hive.merge.mapfiles=true;hive.exec.orc.split.strategy=BI



Note:


This issue is noticed when the the execution engine is set to tez for the pipeline export job in Infowowrks ADE.


Versions Applied:


Infoworks ADE v2.4.x,2.5.x,2.6.x,2.7.x