Unable to load data completely from Delta Lake table to CDF RAW.

Question

Thus far, we have successfully loaded tables containing few million records from Delta Lake to CDF raw. However, we are encountering data skipping issues when we attempt to load more than ten million records from Delta Lake to CDF.

I've just found the cause of all the records' failure to load.

We are generating a sha2 key in our ETL process using the delta lake table primary keys. We don't have any problems with the sha2 key for small tables, but as we hit above 10 million records, we start to get key collisions, which causes some data to skip loading.
In that instance, I made a new index field with incremental numbers like 1, 2, 3, and so on,. I then tried loading the data using the new index, and it now loads flawlessly. I was able to load 60 million records in CDF using this new method.

Jon Austin · Answer

Following up on this and with having conversations with the team, it seems some loss is likely due to 429/50x errors not being wrapped and retried in the client notebook, some logging enhancements were added and these can be seen now. In adding additional error handling and logging it was discovered there wereauthentication/token issues while attempting to read chunks from the source data (not reading from CDF but from the datasources themselves)- this likely also needs to be wrapped and retried in a similar fashion to the APIcalls to Cognite.

Reply

Sign up

Log in to the community

Scanning file for viruses.

This file cannot be downloaded