Description: In DFI ingestion, we can specify a control file to validate the ingested data from a CSV. This feature lets the user to specify a control file (a file with data file metadata) against which user can validate the data file. Here is a sample control file.

The regular expression fields Data Files pattern and Extract format let the user specify the corresponding control file for every data file as a function of the data file path.

Configuration steps:

To configure the Control Files, select the option Source contains control files. 

The control file name should be same as the csv file name with extension as .ctl and should be placed in the same directory as the source .csv file. f.e.g. (If the source file name is control.csv the control file name should be control.ctl)

Provide the Extract format as $1.ctl so that it will pick the control file corresponding to the source file from the directory. The below messages would be displayed in the ingestion log during the validation.

-----log messages------

Validating hdfs://ip-172-30-1-17.ec2.internal:8020/iw/sources/DFI_control_file/test/csv/control.csv at HIVE

[INFO] 2018-06-06 02:15:54,556 [pool-5-thread-1] :: reading hdfs://ip-172-30-1-17.ec2.internal:8020/iw/sources/DFI_control_file/test/csv/control.ctl at HIVE for data file hdfs://ip-172-30-1-17.ec2.internal:8020/iw/sources/DFI_control_file/test/csv/control.csv

[INFO] 2018-06-06 02:15:54,558 [pool-5-thread-1] :: validating count for hdfs://ip-172-30-1-17.ec2.internal:8020/iw/sources/DFI_control_file/test/csv/control.csv at HIVE.


During Count validation, IWX will sum the success records from the target table and the error records from the error records table in hive and checks if that sum is equal to the count specified in the control file.

If the count matches then the job would proceed further otherwise it would fail with the exception below

Table test failed validation failed at HIVE for file hdfs://ip-172-30-1-17.ec2.internal:8020/iw/sources/DFI_control_file/test/csv/control.csv :ControlFileVariable [name=count, value=10,]Stacktrace: java.util.concurrent.ExecutionException: validation failed at HIVE for file hdfs://ip-172-30-1-17.ec2.internal:8020/iw/sources/DFI_control_file/test/csv/control.csv :ControlFileVariable [name=count, value=10,]  

IWX versions 2.3.x,2.4.x