Parked

Improve CDF Raw Store

Related products:Transformations and RAW

1 year ago
January 8, 2024
2 replies
83 views

ibrahim.alsyed
MVP

Current Observations:

When joining multiple Tables (in Cognite Raw) the Transformation (query) is taking a very long time
Cognite is advising that Cognite Raw is a “Key-Value” store that is not optimized for running queries with multiple joins
And that, currently, it is not possible to index or partition data in Cognite Raw to support efficient querying of data

Given these limitations:

Cognite Raw is not an ideal solution to store unfiltered data from source systems, in its original state
Cognite Raw therefore should not be considered as equivalent to a Data or Delta Lake “Raw Zone” or “Bronze Layer”

Other issues:

It is not easy to debug or identify error conditions for long running transformation queries
- e.g., a long running query (with multiple joins, with say million records in few tables) can fail after ~10 hours if there are invalid node references
- all this after reading millions of records – only to fail with an error message that a node reference is invalid or that there is a duplicate key
Joining on a subset of data (using inner queries) – to limit the number of records processed – still results in full table scan i.e., long running transformation

Workaround - Recommendations:

Execute long running joins (transformation queries) on source side (say, SAP) – Not always possible
Consider piecemeal data ingestion and storage, e.g., by Site Code .. to store data in multiple tables (one for each site), instead of storing data for all sites in one table
Using an intermediate data storage and processing layer (between Data Source and Cognite Raw) e.g., Snowflake, Databricks or any other platform

It is really not ideal for us to consider other data lakes and solutions to mitigate these constraints.

Jørgen Lund
Product Manager
1 year ago
January 16, 2024

Hi @ibrahim.alsyed - thanks for this feedback and input.

As you observe, both Raw and Transformations have certain limitations. We are working on making these services more robust and performant. We will also bring the input you proivde here into our efforts to evolve the stage and transform layer of CDF going forward.

A few questions:

Would it be fair to say that transform jobs requiring multiple joins, leading to long run times and aggravating the problem of job failures, are the main pain points you are currently facing?
For the possible workarounds you mention: are you currently doign piecemeal data ingestion (e.g. per Site)? If so, what are the main pain points of having to take this approach?

Jørgen Lund
Product Manager
8 months ago
August 7, 2024

For Q4, we have support for reading content from Files (e.g. Parquet) in Transformations on the roadmap, This will provide an alternative to Raw, and to some extent be relevant to the feedback above.

On a more general basis, we are evaluating the path forward for the services mentioned. We’ll inform you when we have more information to share. In the meantime, since the feedback here is spanning a broad area and a number of issues, I’ll mark it as parked for now, and re-open when we have updates.

Cookie Policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Sign up

Log in to the community

Scanning file for viruses.

This file cannot be downloaded

Cookie Policy

Cookie settings