Improve CDF Raw Store

Related products: Transformations and RAW

Current Observations:

  • When joining multiple Tables (in Cognite Raw) the Transformation (query) is taking a  very long time
  • Cognite is advising that Cognite Raw is a “Key-Value” store that is not optimized for running queries with multiple joins
  • And that, currently, it is not possible to index or partition data in Cognite Raw to support efficient querying of data

 

Given these limitations:

  • Cognite Raw is not an ideal solution to store unfiltered data from source systems, in its original state
  • Cognite Raw therefore should not be considered as equivalent to a Data or Delta Lake “Raw Zone” or “Bronze Layer”

 

Other issues:

  • It is not easy to debug or identify error conditions for long running transformation queries
    • e.g., a long running query (with multiple joins, with say million records in few tables) can fail after ~10 hours if there are invalid node references
    • all this after reading millions of records – only to fail with an error message that a node reference is invalid or that there is a duplicate key
  • Joining on a subset of data (using inner queries) – to limit the number of records processed – still results in full table scan i.e., long running transformation

 

Workaround - Recommendations:

  • Execute long running joins (transformation queries) on source side (say, SAP) – Not always possible
  • Consider piecemeal data ingestion and storage, e.g., by Site Code .. to store data in multiple tables (one for each site), instead of storing data for all sites in one table
  • Using an intermediate data storage and processing layer (between Data Source and Cognite Raw) e.g., Snowflake, Databricks or any other platform

 

It is really not ideal for us to consider other data lakes and solutions to mitigate these constraints.

Hi @ibrahim.alsyed - thanks for this feedback and input.

As you observe, both Raw and Transformations have certain limitations. We are working on making these services more robust and performant. We will also bring the input you proivde here into our efforts to evolve the stage and transform layer of CDF going forward. 

A few questions:

  • Would it be fair to say that transform jobs requiring multiple joins, leading to long run times and aggravating the problem of job failures, are the main pain points you are currently facing? 
  • For the possible workarounds you mention: are you currently doign piecemeal data ingestion (e.g. per Site)? If so, what are the main pain points of having to take this approach?