New

Transformations — Non-deterministic silent upsert when duplicate externalIds are split across workers: request for configurable deduplication strategy

Related products:Transformations

Forum|Forum|3 days ago
May 15, 2026
1 reply
12 views

Oussama ALLALI
Seasoned ⭐️⭐️

Observed behavior

When a CDF Transformation produces rows with duplicate externalId values , the behavior depends on how CDF partitions the data across workers:

Duplicates within the same API request (same batch) → the API raises an error → the transformation fails visibly ✅
Duplicates split across multiple API requests (different CDF partitions/workers) → each request succeeds individually → the transformation completes with status "success", but which version of the node was actually written to the knowledge graph is unknown and non-deterministic ❌

Problem

The second case is the dangerous one:

Silent and invisible — the run reports success, no alert is triggered, no engineer investigates. The data in the knowledge graph may be incomplete or wrong with no trace.
Non-deterministic — which duplicate "wins" depends entirely on which CDF worker flushes first. Two identical runs on the same input can produce different results.
No user control — there is no way to express intent: "fail if duplicates exist", "always keep the latest", "deduplicate before write". The behavior is undefined and undocumented at the transformation level.

Feature Request

Add a conflict resolution option at the transformation level for duplicate externalId within a single run:

Option	Behavior
fail	Abort the run and report an error if any duplicate externalId is detected across all batches
keep_last	Last row processed wins
deduplicate_before_write	CDF deduplicates globally before dispatching to the API

Peter Arwanitis
Practitioner ⭐️⭐️⭐️
Forum|Forum|10 hours ago
May 18, 2026

Hi @Oussama ALLALI thank you for sharing the feature-request on cognite-hub!

We started internal discussion along the three options, can you help us to detail the requirements?

`keep_last` seems to be a special case of `deduplicate_before_write`?
both require additional information to clarify which duplicate to keep? Like a column with time-stamp, or another sort-able column with unique values?

Can you explain based on your real-case, some example data and which to keep?

Hopefully that makes it easier to understand and discuss

thank you

(=PA=)

Like

Sign up

Welcome to Cognite Hub

Scanning file for viruses.

This file cannot be downloaded