Observed behavior
When a CDF Transformation produces rows with duplicate externalId values , the behavior depends on how CDF partitions the data across workers:
- Duplicates within the same API request (same batch) → the API raises an error → the transformation fails visibly ✅
- Duplicates split across multiple API requests (different CDF partitions/workers) → each request succeeds individually → the transformation completes with status "success", but which version of the node was actually written to the knowledge graph is unknown and non-deterministic ❌
Problem
The second case is the dangerous one:
- Silent and invisible — the run reports success, no alert is triggered, no engineer investigates. The data in the knowledge graph may be incomplete or wrong with no trace.
- Non-deterministic — which duplicate "wins" depends entirely on which CDF worker flushes first. Two identical runs on the same input can produce different results.
- No user control — there is no way to express intent: "fail if duplicates exist", "always keep the latest", "deduplicate before write". The behavior is undefined and undocumented at the transformation level.
Feature Request
Add a conflict resolution option at the transformation level for duplicate externalId within a single run:
| Option | Behavior |
| fail | Abort the run and report an error if any duplicate externalId is detected across all batches |
| keep_last | Last row processed wins |
| deduplicate_before_write | CDF deduplicates globally before dispatching to the API |
Check the
documentation
Ask the
Community
Take a look
at
Academy
Cognite
Status
Page
Contact
Cognite Support
