Thus far, we have successfully loaded tables containing few million records from Delta Lake to CDF raw. However, we are encountering data skipping issues when we attempt to load more than ten million records from Delta Lake to CDF.
I've just found the cause of all the records' failure to load.
We are generating a sha2 key in our ETL process using the delta lake table primary keys. We don't have any problems with the sha2 key for small tables, but as we hit above 10 million records, we start to get key collisions, which causes some data to skip loading.
In that instance, I made a new index field with incremental numbers like 1, 2, 3, and so on,. I then tried loading the data using the new index, and it now loads flawlessly. I was able to load 60 million records in CDF using this new method.