How to read the ORC/Parquet files content in HDFS : Infoworks

Description:

Some times we might need to read customers ORC/Parquet file content while debugging issues related to special characters/ encoding related issues.

Solution:

In such scenarios, we can actually request customer to share the files under the hdfs location on which the hive table is created.

For ORC files

a) Run show create table <table_name>

b) Get the hdfs location on which the hive table is created.

c) run the command hive --orcfiledump /iw/sources/TD_test/5cebba060867330202f7a513/merged/orc to read the content of the ORC files.

d) You can redirect the output to a text file or grep it and search for particular content.

For Parquet Files

There is no direct approach to read the parquet files. There is an open source tool that you can use to read the file contents.

a) git clone https://github.com/apache/parquet-mr.git

b) cd parquet-mr/parquet-tools

c) mvn clean package

d) hadoop jar ./parquet-tools-1.12.0-SNAPSHOT.jar cat hdfs://<hostname>:8020/ar_oracle_new/5bea71478098b6f35325100a/merged/parquet/0/current/part-r-00000

We can request customers to zip the ORC/Parquet files and share it with us so that we can run these commands and check the data in-house.

How to read the ORC/Parquet files content in HDFS Print

Related Articles