Issue Summary
Users may observe slow ingestion performance and higher compute costs when ingesting large tables or many tables in a single job. This is often due to the default behavior where ingestion happens sequentially using a single connection. To overcome this, Infoworks provides support for parallel ingestion using Split-By columns and Table Groups.
Solution
1. Use a Split-By Column for Table-Level Parallelism
Split-By enables Spark to partition the table into parallel tasks, each reading a segment of the data independently. This results in faster ingestion, especially for large tables.
How this reduces cost:
Faster ingestion = less cluster runtime
Better resource utilization = avoids overprovisioned VMs sitting idle
Ideal use case:
Primary key or timestamp columns that allow even data distribution.
How it works:
The Split-By column should have calculable
min
andmax
values.Ideal columns: primary key, numeric, date, or timestamp.
Infoworks will partition the column range into slices based on the number of connections allowed.
Example: A table with a
user_id
column ranging from 1 to 1000, split into 10 ranges, will result in 10 parallel ingestion tasks.
Optional:
If the Split-By column is a date or timestamp, enable Derived Split Column and use a function like YEAR(date_col)
or TO_CHAR(date_col, 'YYYY-MM')
to control how data is split.
2. Use Table Groups for Multi-Table Management
A Table Group allows multiple tables to be ingested as part of a single job. Two key configurations control how these tables are handled in parallel:
Max Parallel Entities: Max number of tables that can be ingested at once.
Max Connections: Total DB connections Infoworks can use across all tables.
Example:
If your group has 12 tables, but Max Parallel Entities = 3, only 3 tables run at any given time. The rest wait in the queue.
To ingest all 12 at once, set Max Parallel Entities = 12.
3. Sample Configuration Strategy
You have:
10 tables: 5 small, 5 large
Goal: Ingest small tables with 1 connection and large tables with 10 parallel threads
Suggested Configuration:
Max Parallel Entities = 10
Max Connections = 55
(5 small × 1 connection + 5 large × 10 connections)
⚠️ Important: Always check with your source database administrator to understand the upper limit on concurrent connections. Setting this too high may result in connection failures.
⚠️ Common Pitfall
Enabling Split-By at the table level alone does not launch parallel ingestion.
You must also increaseMax Connections
in the associated Table Group to activate parallel crawling.
Best Practices
Item | Recommendation |
---|---|
Split-By Column | Use primary key or numeric/date columns |
Avoid Data Skew | Ensure evenly distributed values |
Derived Split Column | Use for timestamp/date fields |
Table Group Configuration | Set proper Max Parallel Entities and Max Connections |
DB Limits | Consult DB admin for connection limits |