Problem Description:


During the pipeline build process we will run the row count query on the pipeline target table to show the record count in the job execution progress.


[awb-t-0]:[19:32:20,359] [DEBUG] [HiveJdbcSession] (HiveJdbcSession.java:40) - Executing statement: SELECT COUNT(*) AS `ROW_COUNT` FROM `osipi_hv`.`historianosimapr_archive`



Sometimes, this query might tale more time if the execution engine for the pipeline is mr. 


Root cause:


MR as an execution engine is slower when there are multiple files and folders. Tez implementation has optimized the aggregation. So most likely the aggregation queries should not take much time on tez.



Solution:

We can perform the below steps to improve the job performance.


a) Disable this row count query execution

 
There is a pipeline Advanced config "df_target_rowcount_enabled" they can set to false (default is true). This will not run the count(*) query during the pipeline build job but that trade-off will be that the job execution will not show how many records were inserted in the maprdb target table after the job execution.


b)Try with Tez as the execution engine


Set the execution engine as Tez by setting the below advanced configuration. 


key: df_batch_hive_settings

Value: hive.execution.engine=tez



Applicable Infoworks Versions:


IWX v2.7.x,v2.8.x