Summary
When building workflows in Cognite Data Fusion to populate views in Data Models, there is often a need for intermediate, curated datasets between raw source data and the final target data model.
Today, this intermediate data can be stored in RAW tables, but that requires customers to manage temporary tables, cleanup logic, naming conventions, and lifecycle handling manually. A native, workflow-managed temporary storage layer would make workflow development cleaner, reduce repetitive transformation logic, and simplify the overall implementation.
Business Context
We are using CDF workflows to populate an Asset Hierarchy data model.
The source data comes from multiple systems:
- SAP
- Functional locations
- Equipment
- AVEVA PI
- Tag metadata
The target asset hierarchy data model contains the following views:
- Site
- Area
- Line
- Equipment
- System
- Subsystem
- Tag
The source metadata arrives without the required treatment, standardization, or contextualization. Before writing to the final data model, the data needs to be cleaned, normalized, enriched, and structured according to the target hierarchy.
Current Challenge
In practice, the workflow needs several intermediate transformation steps before writing to the final views.
For example, the workflow may need to transform SAP functional locations into a cleaned and standardized structure before deriving sites, areas, and lines.
Example flow for Site:
tb_functionalLocation
-> tb_functionalLocation_curated
-> tb_site_curated
-> Site view
Example flow for Area:
tb_functionalLocation
-> tb_functionalLocation_curated
-> tb_area_curated
uses tb_site_curated for contextualization
-> Area view
Example flow for Line:
tb_functionalLocation
-> tb_functionalLocation_curated
-> tb_line_curated
uses tb_area_curated for contextualization
-> Line view
Example flow for Equipment:
tb_equipment
-> tb_equipment_curated
-> tb_equipment_contextualized
uses tb_line_curated / tb_area_curated for hierarchy mapping
-> Equipment view
Example flow for System and Subsystem:
tb_functionalLocation + tb_equipment
-> curated functional location and equipment tables
-> tb_system_curated
-> System view
tb_functionalLocation + tb_equipment
-> curated functional location and equipment tables
-> tb_subsystem_curated
-> Subsystem view
Example flow for Tag:
tb_tag
-> tb_tag_curated
-> tb_tag_contextualized
uses equipment/system/subsystem curated data
-> Tag view
These intermediate curated tables are useful because they allow the workflow to:
- Reuse cleaning and standardization logic across multiple transformations.
- Avoid duplicating the same transformation logic in every step.
- Avoid using the final data model views as inputs to transformation logic.
- Keep the workflow logic easier to understand and maintain.
- Separate raw source data, intermediate workflow state, and final modeled data.
However, this intermediate data can be transient. It is only needed while the workflow is running. After the workflow finishes successfully, the data can be deleted. There is also value to optionally allow users to view this intermediate datasets to analyze/debug the quality of the contextualization.
Today, we can use RAW tables as intermediate storage, but this creates additional responsibilities for the customer:
- Creating and maintaining temporary RAW tables.
- Cleaning intermediate tables before or after each workflow run.
- Preventing stale intermediate data from being reused accidentally.
- Managing naming conventions for temporary workflow data.
- Adding cleanup steps to the workflow.
- Handling failed workflow runs where temporary data may be left behind.
- Writing additional code that is not part of the actual business transformation.
Product Idea
Introduce a native workflow-managed temporary storage capability in CDF.
This could work as an internal temporary storage layer for workflows and transformations, where intermediate datasets can be written and read by different workflow steps, but their lifecycle is managed by the workflow execution itself.
Ideally, this temporary storage would be:
- Scoped to a workflow or workflow run
- Usable by transformation steps
- Automatically cleaned up after successful execution, while still allowing end users to later review intermediate datasets for debugging.
- Temporarily retained for debugging and/or review
- Separated from RAW and from the final Data Modeling views
- Managed by CDF instead of customer-maintained cleanup logic
Expected Benefits
This capability would make workflow-based data modeling pipelines much cleaner and easier to maintain.
The main benefits would be:
- Reduced amount of customer-managed code.
- Less duplication of transformation logic.
- Cleaner separation between raw data, temporary workflow state, and final modeled data.
- Reduced risk of stale intermediate data impacting future workflow runs.
- Easier debugging and monitoring of workflow execution.
- More straightforward workflow design for complex contextualization processes.
- Better support for multi-step data preparation before writing to Data Modeling views.
Check the
documentation
Ask the
Community
Take a look
at
Academy
Cognite
Status
Page
Contact
Cognite Support
