How to optimize Data Ingestion Performance [Cognite Official]

Forum|Forum|6 months ago
August 29, 2025
0 replies
117 views

christoffer.hk
Practitioner

In this article, you will learn how to optimize data ingestion performance. You should batch large datasets into smaller chunks, with 250-500 instances being efficient for new data and up to 1000 for updates, but ultimately it depends on the size of each instance. Use the tqdm module to monitor batch processing with a progress bar. For asset hierarchies, ingest from the root down and optionally use a recursive CTE with DuckDB to assign a 'level' column, making the process more efficient. When dealing with transformations in CDF, utilize the is_new function for delta loads to avoid re-reading the entire source table. This function works with a lastUpdatedTime column for RAW tables and a sync API with a label for Data Models. Additionally, improve performance by trimming sparsely populated data and scheduling transformations within workflows to enable a more dynamic and dependent-task-friendly ingestion flow.

Batching Large Data Quantities into Chunks

You can achieve more steady throughput, while also reducing the risk of encountering errors by batching up your requests into smaller batches. The batches should take into account the size of each instance, but generally I find that requesting batches with around 250-500 instances at a time is quite efficient.

Sometimes 1000 instances at a time is manageable as well, but from my experience this size is better for updates on existing instances.

Here’s an example function that creates a generator of batches. It accepts a list of elements which you want to put into smaller batches

def generate_batches(data: list, batch_size: int = 1000):
    for i in range(0, len(data), batch_size):
        yield data[i : i + batch_size]

Use tqdm to Monitor Ingestion Process

One thing I’ve grown fond of from projects where I have to ingest data manually is to use the “tqdm” module in Python to keep track of how far the process has gotten.

It provides a progress bar and an estimated time before completion so you get a better overview of when the ingestion process will be completed.

We can extend our code snippet to use tqdm for monitoring the ingestion process:

batches = generate_batches(data, batch_size=500)
for batch in tqdm.tqdm(batches):
    cdf_client.data_modeling.instances.apply(nodes=batch)

Then when running the for loop we will get a progress bar similar to the one below!

The only caveat is that multithreading is not supported with tqdm.

Ingesting Asset Hierarchies to CogniteAsset (CDM)

Something I found out the hard way in my project on migrating snapshot data from Classic CDF to CDM is that when ingesting a complete asset hierarchy, it’s best to ingest from root and down into the hierarchy.

This is the recommended approach since the higher up in the hierarchy you change something, the more work the path materializer needs to redo. So if you keep adding new roots, then all paths need to be rematerialized all the time.

Creating a ‘level’ Column to Make Root-and-Down Ingestion Easy

If you are going to migrate an asset hierarchy from Classic CDF to CDM, perhaps with a parquet dump provided by the Toolkit, I can recommend an approach to make the root and down ingestion easy.

A very performant option for calculating the depth of all assets in your asset hierarchy is to use a DuckDB-integration with Pandas, which allows you to use a recursive CTE to set a ‘level’ on all assets. Then given that ‘level’ column, you can batch your assets first by level, then by reasonably sized chunks.

DuckDB Query

The recursive CTE can be applied on a Pandas DataFrame so you can keep your work within Python.

import duckdb
import pandas as pd

def duckdb_assign_levels(df: pd.DataFrame) -> pd.DataFrame:
    """Assign levels to every asset in the asset hierarchy using duckdb and recursive CTE."""
    query = """
    WITH RECURSIVE hierarchy AS (
        SELECT externalId, parentExternalId, 0 AS level
        FROM df
        WHERE parentExternalId IS NULL
        UNION ALL
        SELECT a.externalId, a.parentExternalId, h.level + 1
        FROM df a
        JOIN hierarchy h ON a.parentExternalId = h.externalId
    )
    SELECT * FROM hierarchy
    """
    return duckdb.query(query).to_df()

CDF Transformations and ‘is_new’ for Delta Load

A crucial detail when writing performant transformations in CDF is to use the ‘is_new’ function to avoid reading the entire raw table / source data model over and over again.

The transformation computes the delta based on the last successful run it had.

The ‘is_new’ function is treated a bit differently between reading from a RAW table versus reading from a data model.

For RAW tables a typical column to use for determining the delta is the ‘lastUpdatedTime’.
For Data Models, the sync API is used. Instead of assigning the column to base the delta off of, you assign it a label, which is used for identifying the filter. For the delta to work, you need to keep the cursor alive, which means ensuring the transformation runs more often than every third day.

A common labeling technique I’ve found to be useful is to label the ‘is_new’ according to the last date it was necessary to run a backfill. Changing the label of the ‘is_new’ is treated as a new state in the transformation, which enables you to run the transformation on the entire table again if necessary.

Typically, you will only have an ‘is_new’ on the main table you’re reading from when joining multiple sources together. Having an ‘is_new’ on multiple sources in the same transformation wouldn’t necessarily yield any results, unless you are aware that all the tables are in sync.

Keep in mind that ‘is_new’ is ignored when running the transformation preview.

Disclaimer: There is a chance that the ‘lastUpdatedTime’ column can change, even when the data in the row hasn’t changed. Then the delta doesn’t capture a real change in the data. The cause isn’t immediately clear though.

This affects the ‘is_new’ function and the cursor, so all affected rows will be captured in the delta. If this is the case for you, an alternative solution is to use a different column, or a state store.

Trimming the Data Sent to RAW

For optimizing performance in transformations, sparsely populated data with a lot of None or empty string values take up more space, so try to avoid them unless it’s strictly necessary.

A downside with this that can be worth keeping in mind is that Transformations infer a schema based on a subset of the data in the RAW table. So if you have a RAW table that contains rows that have very varied data, there is a chance that certain columns might be omitted.

To prevent that you can specify your ‘from’ statement using cdf_raw(“database-name”, “table-name”). It’s also covered in the documentation here.

Trimming the Data Sent to RAW from Python Code

You can execute similar behaviour using the Python SDK as well when inserting dataframes to RAW. When using the ‘insert_dataframe’, you can specify the dropna parameter to remove NaN values. Fortunately, you might already be doing this without knowing, since by default this is set to True!

Scheduling Transformations in Workflows

Scheduling CDF Transformations into Workflows over just using the Transformation’s own schedule is preferred to keep the ingestion flow more dynamic.

If you have transformations or functions that depend on a different transformation having successfully run, having them scheduled in a workflow will let the subsequent tasks run as soon as the transformation is completed.

Batching Large Data Quantities into Chunks

Use tqdm to Monitor Ingestion Process

Ingesting Asset Hierarchies to CogniteAsset (CDM)

Creating a ‘level’ Column to Make Root-and-Down Ingestion Easy

DuckDB Query

CDF Transformations and ‘is_new’ for Delta Load

Trimming the Data Sent to RAW

Trimming the Data Sent to RAW from Python Code

Scheduling Transformations in Workflows

Sign up

Welcome to Cognite Hub

Scanning file for viruses.

This file cannot be downloaded