Description:


Some times we might need to read customers ORC/Parquet file content while debugging issues related to special characters/ encoding related issues.


Solution: 


In such scenarios, we can actually request customer to share the files under the hdfs location on which the hive table is created. 


For ORC files


a) Run show create table <table_name>

b) Get the hdfs location on which the hive table is created.

c) run the command hive --orcfiledump /iw/sources/TD_test/5cebba060867330202f7a513/merged/orc to read the content of the ORC files.


d) You can redirect the output to a text file or grep it and search for particular content.


For Parquet Files


There is no direct approach to read the parquet files. There is an open source tool that you can use to read the file contents.

a) git clone https://github.com/apache/parquet-mr.git

b) cd parquet-mr/parquet-tools

c) mvn clean package

d) hadoop jar ./parquet-tools-1.12.0-SNAPSHOT.jar cat hdfs://<hostname>:8020/ar_oracle_new/5bea71478098b6f35325100a/merged/parquet/0/current/part-r-00000


We can request customers to zip the ORC/Parquet files and share it with us so that we can run these commands and check the data in-house.