What is the rationale for running all segments as one job? Doesn't this defeat the purpose of parallel loading?


** Let me explain this with a simple example. I have configured a table with TotalChildren count as a segmented column and you could see all the possible segments (6 of them) as shown below and their respective number of records.


image.png


I have selected the segments with TotalChildren count as 4,5 and 2 and ingested them by selecting the option to run the segments as a single job.

 Infoworks would deploy a single databricks cluster job. The records for the selected segments will be fetched from the source system as a single SQL query as a part of a single databricks cluster log. See the chunk load query generated for this job below.



image.png


QueryPattern1: 


chunkLoadQuery : select "CustomerKey" as "CustomerKey" , "GeographyKey" as "GeographyKey" , "CustomerAlternateKey" as "CustomerAlternateKey" , "Title" as "Title" , "FirstName" as "FirstName" , "MiddleName" as "MiddleName" , "LastName" as "LastName" , "NameStyle" as "NameStyle" , "BirthDate" as "BirthDate" , "MaritalStatus" as "MaritalStatus" , "Suffix" as "Suffix" , "Gender" as "Gender" , "EmailAddress" as "EmailAddress" , "YearlyIncome" as "YearlyIncome" , "TotalChildren" as "TotalChildren" , "NumberChildrenAtHome" as "NumberChildrenAtHome" , "EnglishEducation" as "EnglishEducation" , "SpanishEducation" as "SpanishEducation" , "FrenchEducation" as "FrenchEducation" , "EnglishOccupation" as "EnglishOccupation" , "SpanishOccupation" as "SpanishOccupation" , "FrenchOccupation" as "FrenchOccupation" , "HouseOwnerFlag" as "HouseOwnerFlag" , "NumberCarsOwned" as "NumberCarsOwned" , "AddressLine1" as "AddressLine1" , "AddressLine2" as "AddressLine2" , "Phone" as "Phone" , "DateFirstPurchase" as "DateFirstPurchase" , "CommuteDistance" as "CommuteDistance" from "master"."dbo"."DimCustomer" where (("TotalChildren" = 4) OR ("TotalChildren" = 5) OR ("TotalChildren" = 2)) and ("CustomerKey" <= 29481)

 

Later, I wanted to ingest the remaining three segments but this time in parallel and I did not check the option to run them as a single job but provided Max Table Segments in parallel as 3. So Infoworks has deployed 3 databricks cluster jobs and ingested the segments in parallel.


image.png


And if you download the job log for any of the jobs and see the chunk query, you will notice that 3 separate queries would be executed on the source system as part of three separate jobs.


QueryPattern2:


Job1:


chunkLoadQuery : select "CustomerKey" as "CustomerKey" , "GeographyKey" as "GeographyKey" , "CustomerAlternateKey" as "CustomerAlternateKey" , "Title" as "Title" , "FirstName" as "FirstName" , "MiddleName" as "MiddleName" , "LastName" as "LastName" , "NameStyle" as "NameStyle" , "BirthDate" as "BirthDate" , "MaritalStatus" as "MaritalStatus" , "Suffix" as "Suffix" , "Gender" as "Gender" , "EmailAddress" as "EmailAddress" , "YearlyIncome" as "YearlyIncome" , "TotalChildren" as "TotalChildren" , "NumberChildrenAtHome" as "NumberChildrenAtHome" , "EnglishEducation" as "EnglishEducation" , "SpanishEducation" as "SpanishEducation" , "FrenchEducation" as "FrenchEducation" , "EnglishOccupation" as "EnglishOccupation" , "SpanishOccupation" as "SpanishOccupation" , "FrenchOccupation" as "FrenchOccupation" , "HouseOwnerFlag" as "HouseOwnerFlag" , "NumberCarsOwned" as "NumberCarsOwned" , "AddressLine1" as "AddressLine1" , "AddressLine2" as "AddressLine2" , "Phone" as "Phone" , "DateFirstPurchase" as "DateFirstPurchase" , "CommuteDistance" as "CommuteDistance" from "master"."dbo"."DimCustomer" where (("TotalChildren" = 1)) and ("CustomerKey" <= 29481)


Job2:


chunkLoadQuery : select "CustomerKey" as "CustomerKey" , "GeographyKey" as "GeographyKey" , "CustomerAlternateKey" as "CustomerAlternateKey" , "Title" as "Title" , "FirstName" as "FirstName" , "MiddleName" as "MiddleName" , "LastName" as "LastName" , "NameStyle" as "NameStyle" , "BirthDate" as "BirthDate" , "MaritalStatus" as "MaritalStatus" , "Suffix" as "Suffix" , "Gender" as "Gender" , "EmailAddress" as "EmailAddress" , "YearlyIncome" as "YearlyIncome" , "TotalChildren" as "TotalChildren" , "NumberChildrenAtHome" as "NumberChildrenAtHome" , "EnglishEducation" as "EnglishEducation" , "SpanishEducation" as "SpanishEducation" , "FrenchEducation" as "FrenchEducation" , "EnglishOccupation" as "EnglishOccupation" , "SpanishOccupation" as "SpanishOccupation" , "FrenchOccupation" as "FrenchOccupation" , "HouseOwnerFlag" as "HouseOwnerFlag" , "NumberCarsOwned" as "NumberCarsOwned" , "AddressLine1" as "AddressLine1" , "AddressLine2" as "AddressLine2" , "Phone" as "Phone" , "DateFirstPurchase" as "DateFirstPurchase" , "CommuteDistance" as "CommuteDistance" from "master"."dbo"."DimCustomer" where (("TotalChildren" = 0)) and ("CustomerKey" <= 29481)


Job3


chunkLoadQuery : select "CustomerKey" as "CustomerKey" , "GeographyKey" as "GeographyKey" , "CustomerAlternateKey" as "CustomerAlternateKey" , "Title" as "Title" , "FirstName" as "FirstName" , "MiddleName" as "MiddleName" , "LastName" as "LastName" , "NameStyle" as "NameStyle" , "BirthDate" as "BirthDate" , "MaritalStatus" as "MaritalStatus" , "Suffix" as "Suffix" , "Gender" as "Gender" , "EmailAddress" as "EmailAddress" , "YearlyIncome" as "YearlyIncome" , "TotalChildren" as "TotalChildren" , "NumberChildrenAtHome" as "NumberChildrenAtHome" , "EnglishEducation" as "EnglishEducation" , "SpanishEducation" as "SpanishEducation" , "FrenchEducation" as "FrenchEducation" , "EnglishOccupation" as "EnglishOccupation" , "SpanishOccupation" as "SpanishOccupation" , "FrenchOccupation" as "FrenchOccupation" , "HouseOwnerFlag" as "HouseOwnerFlag" , "NumberCarsOwned" as "NumberCarsOwned" , "AddressLine1" as "AddressLine1" , "AddressLine2" as "AddressLine2" , "Phone" as "Phone" , "DateFirstPurchase" as "DateFirstPurchase" , "CommuteDistance" as "CommuteDistance" from "master"."dbo"."DimCustomer" where (("TotalChildren" = 1)) and ("CustomerKey" <= 29481)


Note: We have provided flexibility to the user to decide whether the user wants to restrict the compute cost when he has only limited connections available to the source system by running the selected segments as a single job versus he could choose to run the segments in parallel with multiple cluster jobs when he has multiple connections available that he can use to connect to the source system, which reduces the time to ingest.