Skip to main content
New

Transformations — Non-deterministic silent upsert when duplicate externalIds are split across workers: request for configurable deduplication strategy

Related products:Transformations
  • May 15, 2026
  • 1 reply
  • 12 views

Oussama ALLALI
Seasoned ⭐️⭐️

Observed behavior

When a CDF Transformation produces rows with duplicate externalId values , the behavior depends on how CDF partitions the data across workers:

  • Duplicates within the same API request (same batch) → the API raises an error → the transformation fails visibly ✅
  • Duplicates split across multiple API requests (different CDF partitions/workers) → each request succeeds individually → the transformation completes with status "success", but which version of the node was actually written to the knowledge graph is unknown and non-deterministic ❌

Problem

The second case is the dangerous one:

  1. Silent and invisible — the run reports success, no alert is triggered, no engineer investigates. The data in the knowledge graph may be incomplete or wrong with no trace.
  2. Non-deterministic — which duplicate "wins" depends entirely on which CDF worker flushes first. Two identical runs on the same input can produce different results.
  3. No user control — there is no way to express intent: "fail if duplicates exist", "always keep the latest", "deduplicate before write". The behavior is undefined and undocumented at the transformation level.

Feature Request

Add a conflict resolution option at the transformation level for duplicate externalId within a single run:

Option

Behavior

fail

Abort the run and report an error if any duplicate externalId is detected across all batches

keep_last

Last row processed wins

deduplicate_before_write

CDF deduplicates globally before dispatching to the API

1 reply

Peter  Arwanitis
Practitioner ⭐️⭐️⭐️
Forum|alt.badge.img
  • Practitioner ⭐️⭐️⭐️
  • May 18, 2026

Hi ​@Oussama ALLALI thank you for sharing the feature-request on cognite-hub!

We started internal discussion along the three options, can you help us to detail the requirements?

  1. `keep_last` seems to be a special case of `deduplicate_before_write`?
  2. both require additional information to clarify which duplicate to keep? Like a column with time-stamp, or another sort-able column with unique values?

Can you explain based on your real-case, some example data and which to keep?

 

Hopefully that makes it easier to understand and discuss

thank you

(=PA=)