Problem Description


In CSV file ingestion, the source file(s) would be copied from the Source patch location to hdfs location as shown below.


 copying file, source :/opt/prod/common/aoScoring/IN_AO_SCORE/IVR_DISPOSITION_2015_03F/EXA_IVR_DISP00005_20151231000000 , dest: /data/prod/PBM/PUB/RTLPHMCY/RxDW//IVR_DISPOSITION//csv//EXA_IVR_DISP00005_20151231000000


Then we calculate the checksum for the file copied to HDFS to determine if the file has been modified.


calculating checksum for HDFS pathhdfs://rtlhdprod.corp.cvscaremark.com:8020/data/prod/PBM/PUB/RTLPHMCY/RxDW/IVR_DISPOSITION/csv/EXA_IVR_DISP00001_20151231000000


If we remove the already ingested csv file from the source path and ingest a new file, the checksum for the previously ingested file which is still there in the hdfs location, would be calculated.


Consider a scenario where you have ingested some 10-15 CSV files of huge size and then remove these files from the source path, these files would still be present in HDFS and during the next ingestion with a new CSV file, the checksum would be calculated for these old files as well.


This checksum calculation would take some time considering the size and the number of CSV files present in the HDFS location and this might impact the total ingestion time.


Solution


Set the below property at the Admin>Configuration and then click on Initialize and ingest. This property would skip the checksum task and use the modified time to determine if the file has been modified.


modified_time_as_cksum  to true