Summary
This How-to guide provides practical tips and best practices for creating efficient, scalable, and well-managed transformations in Cognite Data Fusion (CDF).
Understand the Goal of a Transformation
Transformations in CDF enable you to extract, process, and load (ETL) data across various data sources.
A good transformation should be:
- Efficient – uses minimal compute and API resources.
- Reliable – handles retries, throttling, and scaling gracefully.
- Maintainable – easy to monitor, schedule, and troubleshoot.
Filter Early and Select Only What You Need
Reducing data size early in your query leads to faster, lighter transformations.
- Apply WHERE clauses to limit datasets.
- Select only required columns.
- Avoid SELECT * in production transformations.
Example:
SELECT asset_id, timestamp, temperature
FROM raw.sensors.readings
WHERE timestamp >= CURRENT_DATE - INTERVAL '1 DAY'Improves performance and reduces load on the data platform.
Use Incremental Logic with is_new
The is_new flag allows transformations to process only new or changed records since the last run.
Benefits:
- Processes only recent updates.
- Reduces execution time and data volume.
- Prevents duplicate processing.
- Minimizes API rate-limit (429) errors.
select * from mydb.mytable where is_new("mydb_mytable_version", lastUpdatedTime)-- Returns only rows that have changed since the last successful run
Resource : CDF Is_new Query in Transformations
Avoid Heavy or Unnecessary Joins
Large joins are resource-intensive and can cause slow execution or memory errors.
Optimization tips:
- Filter both datasets before joining.
- Join only on the necessary columns.
- Use smaller lookup tables or pre-aggregated data when possible.
Example:
SELECT A.asset_id, A.avg_value, B.location
FROM (
SELECT asset_id, AVG(value) AS avg_value
FROM raw.sensor_data
WHERE timestamp >= CURRENT_DATE - INTERVAL '1 DAY'
GROUP BY asset_id
) A
JOIN raw.asset_info B
ON A.asset_id = B.asset_id;More efficient joins with reduced compute load.
Schedule Transformations Strategically
Avoid running too many heavy transformations concurrently.
- Use transformation schedules to stagger execution times.
- Run compute-heavy jobs during off-peak hours.
- Combine smaller, dependent transformations into single workflow sequences.
- If you have transformations running more than 30 mins considering the processing data load make it scheduled at least after 40 mins / 1 hour to avoid interruption and monitor them regularly.
Run Heavy or Failing Transformations via Workflows
Running large or frequently failing transformations directly can make them difficult to manage.
Instead, orchestrate them through CDF Workflows for better control, error handling, and visibility.
Why Run Heavy Transformations via Workflows
- Enables retry logic, error notifications, and parallel execution control.
- Allows you to chain multiple transformations in logical order (staging → enrichment → modeling).
- Simplifies recovery and monitoring for complex pipelines.
- Prevents system overload by sequencing heavy jobs.
Recommended use cases:
- Heavy transformations processing millions of rows.
- Transformations that frequently time out or fail due to data volume.
- Critical pipelines requiring orchestration with alerts and checkpoints.
Benefits:
- Centralized management and monitoring.
- Automatic retries and notifications.
- Controlled execution flow and dependency handling.
Reference: CDF Workflows Documentation
Monitor and Tune Regularly
Regularly review transformation performance:
- Use transformation run logs to identify bottlenecks or frequent 429s.
- Track duration and data volume per run.
- Adjust batch size, filters, or schedules as data grows.
- For repeatedly failing transformations, migrate them to Workflows for reliability and better insight.
Keep Transformations Modular and Documented
Maintain transformation clarity and traceability:
- Split large logic into multiple smaller transformations (e.g., staging → enrichment → output).
- Use descriptive names like transform_equipment_data_staging for the identification and tracking purpose.
- Add comments in SQL/Python for logic explanation.
- Always make the processes separated in to batch wise
- Add error exceptions
Check the
documentation
Ask the
Community
Take a look
at
Academy
Cognite
Status
Page
Contact
Cognite Support