Skip to main content

Thus far, we have successfully loaded tables containing few million records from Delta Lake to CDF raw. However, we are encountering data skipping issues when we attempt to load more than ten million records from Delta Lake to CDF.


I've just found the cause of all the records' failure to load.


We are generating a sha2 key in our ETL process using the delta lake table primary keys. We don't have any problems with the sha2 key for small tables, but as we hit above 10 million records, we start to get key collisions, which causes some data to skip loading.
In that instance, I made a new index field with incremental numbers like 1, 2, 3, and so on,. I then tried loading the data using the new index, and it now loads flawlessly. I was able to load 60 million records in CDF using this new method.

Following up on this and with having conversations with the team, it seems some loss is likely due to 429/50x errors not being wrapped and retried in the client notebook, some logging enhancements were added and these can be seen now.  In adding additional error handling and logging it was discovered there were authentication/token issues while attempting to read chunks from the source data  (not reading from CDF but from the datasources themselves) - this likely also needs to be wrapped and retried in a similar fashion to the API calls to Cognite.  


Reply