Current Observations:
- When joining multiple Tables (in Cognite Raw) the Transformation (query) is taking a very long time
- Cognite is advising that Cognite Raw is a “Key-Value” store that is not optimized for running queries with multiple joins
- And that, currently, it is not possible to index or partition data in Cognite Raw to support efficient querying of data
Given these limitations:
- Cognite Raw is not an ideal solution to store unfiltered data from source systems, in its original state
- Cognite Raw therefore should not be considered as equivalent to a Data or Delta Lake “Raw Zone” or “Bronze Layer”
Other issues:
- It is not easy to debug or identify error conditions for long running transformation queries
- e.g., a long running query (with multiple joins, with say million records in few tables) can fail after ~10 hours if there are invalid node references
- all this after reading millions of records – only to fail with an error message that a node reference is invalid or that there is a duplicate key
- Joining on a subset of data (using inner queries) – to limit the number of records processed – still results in full table scan i.e., long running transformation
Workaround - Recommendations:
- Execute long running joins (transformation queries) on source side (say, SAP) – Not always possible
- Consider piecemeal data ingestion and storage, e.g., by Site Code .. to store data in multiple tables (one for each site), instead of storing data for all sites in one table
- Using an intermediate data storage and processing layer (between Data Source and Cognite Raw) e.g., Snowflake, Databricks or any other platform
It is really not ideal for us to consider other data lakes and solutions to mitigate these constraints.