Issue Summary

Users may observe slow ingestion performance and higher compute costs when ingesting large tables or many tables in a single job. This is often due to the default behavior where ingestion happens sequentially using a single connection. To overcome this, Infoworks provides support for parallel ingestion using Split-By columns and Table Groups.


Solution

1. Use a Split-By Column for Table-Level Parallelism

Split-By enables Spark to partition the table into parallel tasks, each reading a segment of the data independently. This results in faster ingestion, especially for large tables.


How this reduces cost:

  • Faster ingestion = less cluster runtime

  • Better resource utilization = avoids overprovisioned VMs sitting idle

Ideal use case:
Primary key or timestamp columns that allow even data distribution.

How it works:

  • The Split-By column should have calculable min and max values.

  • Ideal columns: primary key, numeric, date, or timestamp.

  • Infoworks will partition the column range into slices based on the number of connections allowed.

Example: A table with a user_id column ranging from 1 to 1000, split into 10 ranges, will result in 10 parallel ingestion tasks.

Optional:
If the Split-By column is a date or timestamp, enable Derived Split Column and use a function like YEAR(date_col) or TO_CHAR(date_col, 'YYYY-MM') to control how data is split.


2. Use Table Groups for Multi-Table Management

A Table Group allows multiple tables to be ingested as part of a single job. Two key configurations control how these tables are handled in parallel:

  • Max Parallel Entities: Max number of tables that can be ingested at once.

  • Max Connections: Total DB connections Infoworks can use across all tables.

Example:
If your group has 12 tables, but Max Parallel Entities = 3, only 3 tables run at any given time. The rest wait in the queue.
To ingest all 12 at once, set Max Parallel Entities = 12.


3. Sample Configuration Strategy

You have:

  • 10 tables: 5 small, 5 large

  • Goal: Ingest small tables with 1 connection and large tables with 10 parallel threads

Suggested Configuration:

  • Max Parallel Entities = 10

  • Max Connections = 55
    (5 small × 1 connection + 5 large × 10 connections)

⚠️ Important: Always check with your source database administrator to understand the upper limit on concurrent connections. Setting this too high may result in connection failures.


⚠️ Common Pitfall

Enabling Split-By at the table level alone does not launch parallel ingestion.
You must also increase Max Connections in the associated Table Group to activate parallel crawling.


Best Practices

ItemRecommendation
Split-By ColumnUse primary key or numeric/date columns
Avoid Data SkewEnsure evenly distributed values
Derived Split ColumnUse for timestamp/date fields
Table Group ConfigurationSet proper Max Parallel Entities and Max Connections
DB LimitsConsult DB admin for connection limits