How to Build Efficient Transformations in CDF [Cognite Official]

Forum|Forum|4 months ago
October 31, 2025
0 replies
56 views

+8

Sachin Perera
Practitioner

Summary

This How-to guide provides practical tips and best practices for creating efficient, scalable, and well-managed transformations in Cognite Data Fusion (CDF).

Understand the Goal of a Transformation

Transformations in CDF enable you to extract, process, and load (ETL) data across various data sources.
A good transformation should be:

Efficient – uses minimal compute and API resources.
Reliable – handles retries, throttling, and scaling gracefully.
Maintainable – easy to monitor, schedule, and troubleshoot.

Filter Early and Select Only What You Need

Reducing data size early in your query leads to faster, lighter transformations.

Apply WHERE clauses to limit datasets.
Select only required columns.
Avoid SELECT * in production transformations.

Example:

SELECT asset_id, timestamp, temperature

FROM raw.sensors.readings

WHERE timestamp >= CURRENT_DATE - INTERVAL '1 DAY'

Improves performance and reduces load on the data platform.

Use Incremental Logic with is_new

The is_new flag allows transformations to process only new or changed records since the last run.

Benefits:

Processes only recent updates.
Reduces execution time and data volume.
Prevents duplicate processing.
Minimizes API rate-limit (429) errors.

select * from mydb.mytable where is_new("mydb_mytable_version", lastUpdatedTime)

-- Returns only rows that have changed since the last successful run

Resource : CDF Is_new Query in Transformations

Avoid Heavy or Unnecessary Joins

Large joins are resource-intensive and can cause slow execution or memory errors.

Optimization tips:

Filter both datasets before joining.
Join only on the necessary columns.
Use smaller lookup tables or pre-aggregated data when possible.

Example:

SELECT A.asset_id, A.avg_value, B.location

FROM (

  SELECT asset_id, AVG(value) AS avg_value

  FROM raw.sensor_data

  WHERE timestamp >= CURRENT_DATE - INTERVAL '1 DAY'

  GROUP BY asset_id

) A

JOIN raw.asset_info B

ON A.asset_id = B.asset_id;

More efficient joins with reduced compute load.

Schedule Transformations Strategically

Avoid running too many heavy transformations concurrently.

Use transformation schedules to stagger execution times.
Run compute-heavy jobs during off-peak hours.
Combine smaller, dependent transformations into single workflow sequences.
If you have transformations running more than 30 mins considering the processing data load make it scheduled at least after 40 mins / 1 hour to avoid interruption and monitor them regularly.

Run Heavy or Failing Transformations via Workflows

Running large or frequently failing transformations directly can make them difficult to manage.
Instead, orchestrate them through CDF Workflows for better control, error handling, and visibility.

Why Run Heavy Transformations via Workflows

Enables retry logic, error notifications, and parallel execution control.
Allows you to chain multiple transformations in logical order (staging → enrichment → modeling).
Simplifies recovery and monitoring for complex pipelines.
Prevents system overload by sequencing heavy jobs.

Recommended use cases:

Heavy transformations processing millions of rows.
Transformations that frequently time out or fail due to data volume.
Critical pipelines requiring orchestration with alerts and checkpoints.

Benefits:

Centralized management and monitoring.
Automatic retries and notifications.
Controlled execution flow and dependency handling.

Reference: CDF Workflows Documentation

Monitor and Tune Regularly

Regularly review transformation performance:

Use transformation run logs to identify bottlenecks or frequent 429s.
Track duration and data volume per run.
Adjust batch size, filters, or schedules as data grows.
For repeatedly failing transformations, migrate them to Workflows for reliability and better insight.

Keep Transformations Modular and Documented

Maintain transformation clarity and traceability:

Split large logic into multiple smaller transformations (e.g., staging → enrichment → output).
Use descriptive names like transform_equipment_data_staging for the identification and tracking purpose.
Add comments in SQL/Python for logic explanation.
Always make the processes separated in to batch wise
Add error exceptions

Understand the Goal of a Transformation

Filter Early and Select Only What You Need

Use Incremental Logic with is_new

Avoid Heavy or Unnecessary Joins

Schedule Transformations Strategically

Run Heavy or Failing Transformations via Workflows

Monitor and Tune Regularly

Keep Transformations Modular and Documented

References

Sign up

Welcome to Cognite Hub

Scanning file for viruses.

This file cannot be downloaded