Problem statement: Pipeline using replicated tables as source tables may fail with below error similar to the show below.

Caused by: java.lang.ClassCastException: org.apache.orc.storage.serde2.io.DateWritable cannot be cast to org.apache.hadoop.io.Text


Root cause: The error occurs whenever there is mismatch in the schema defined in the table ddl and the schema of the underlying ORC files. This can happen for the replicated tables in the following scenarios.

Scenario 1: On-premise table is dropped and re-created with a different schema (example: data type of column changed). One can verify this by comparing the CreateTime (from below command) of this table with replicated table from dataproc.

command: describe formatted <tablename>;


Solution:

1. Drop the replicated table from Dataproc.

drop table <tablename>;

2. Re-run the replicated table with below config.

Key: TRUNCATE_OVERWRITE

Value: true