Scenario 1: Row count for the table replicated on dataproc hive is not matching with row count of the table in on-premise HDP hive.

When user submits a "select count(*) from table_name" query to hive. It simply answers using the statistics stored in the metastore if hive.compute.query.using.stats is true.
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Statistics


Due to this, one may still get row counts from the past both on hdp hive and dataproc hive. 

To force hive to actually compute the count at that moment, one needs to set the below property and re-run the query.
hive>set hive.compute.query.using.stats=false;

Please re-validate the row counts on gcp and on-premise hdp after setting above property in hive. 


If the row counts still do not match and if the on-premise table is dropped and re-created, please try the solution mentioned in
https://support.infoworks.io/a/solutions/articles/14000140405

Scenario 2: Row count for the table replicated on dataproc hive is matching with row count of the table in on-premise HDP hive. However this row count is not matching with row count on bigquery after running Infoworks pipeline. 


Infoworks pipeline read the data from the using spark before writing to bigquery. There could be different in row count in some corners cases due to inherent differences in spark and hive. 


To validate this, 

1. Login to the hive shell on dataproc and capture the row count of the table. 

set hive.compute.query.using.stats=false;
select count(*) from table_name ;


2. Login to spark-sql shell on dataproc and capture the row count of the table.

spark-sql
set hive.compute.query.using.stats=false;
select count(*) from table_name ;

If these counts are not matching, reach out to Infoworks support with Table DDL and Infoworks pipeline build logs. 

3. If the counts are matching, please validate if there is any partition expiry set on the biqquery table. 

SELECT * FROM `project_id.dataset_id.INFORMATION_SCHEMA.TABLE_OPTIONS` where table_name='my_table'
Note: project_id, dataset_id and my_table are place holders which needs to be replaced with actual values.