File Extractor Bugs and Behavioural Issues

Question

At Aker BP we are currently using the Cognite file extractor, and have some bugs and improvement points to report. Some of the bugs here are a big concern for us using the extractor going forward:

1. Missing‑as‑Deleted Flag Does Not Work as Documented

When missing-as-deleted: true and delete-behavior: hard are configured, the extractor does not delete files that are absent from the source query.
This contradicts the documented behaviour and makes it impossible to rely on the extractor to keep target datasets clean.
Because the executable is closed‑source, the root cause is unclear.
From a developer who configures the file extractor: This bug went undiscovered for some time because the previous extractor had to be replaced, and has different behavior from the new file extractor on full loads. (This seems to be a bug in the new extractor.) The previous deprecated Documentum Extractor would delete all files that it did not upload for that specific extraction. This also caused issues in the past, because it did not properly handle API errors, so if the D2 API failed it would delete all files in the ds_d2-bms dataset. But it also meant that if any files were no longer marked as FINAL in D2, they would be deleted from CDF which is what we want. The new extractor would not delete any files if the API failed, so I assumed that they just fixed the error handling rather than changing the deletion behavior. The documentation would also indicate that this is the case. Here is a sub-section of our Cognite File Extractor configuration:
```
missing-as-deleted: true
delete-behavior:
 mode: hard
```
As you can see we have set the extractor to treat missing files as files that should be deleted from CDF. But when files are not present in the query the extractor uses, nothing is deleted. It is hard to say why because we don't have access to the Cognite File Extractor source code

2. Extractor Handles API Errors Incorrectly / Inconsistently

The new extractor suppresses deletions whenever the source API has issues, but it also fails to delete files even when there is no error.
This behaviour is inconsistent with the previous extractor and with the current documentation.

3. Full‑load Behaviour Causes Extended Execution Time

The extractor performs full replacements instead of incremental updates, which scales poorly as file counts grow.
This leads to significant runtime increases that break downstream timing assumptions.

4. Extractor Design Prevents Synchronized Multi‑Step Pipelines

Because the extractor runs independently and cannot signal completion to downstream steps, pipelines relying on metadata post‑processing (contextualization) become unstable.
Files may exist in partially processed states when the next extractor starts, causing downstream systems to retrieve invalid or incomplete metadata.

Would like to discuss plan of action for these issues, since we don’t have access to the extractor source code.

Mithila Jayalath · Answer

​@HaakonWe have created a support ticket for this issue. The support team will reach out to you.

Sign up

Welcome to Cognite Hub

Scanning file for viruses.

This file cannot be downloaded