Skip to main content
Gathering Interest

Transformations — Non-deterministic silent upsert when duplicate externalIds are split across workers: request for configurable deduplication strategy

Related products:Transformations
  • May 15, 2026
  • 2 replies
  • 24 views

Oussama ALLALI
Seasoned ⭐️⭐️

Observed behavior

When a CDF Transformation produces rows with duplicate externalId values , the behavior depends on how CDF partitions the data across workers:

  • Duplicates within the same API request (same batch) → the API raises an error → the transformation fails visibly ✅
  • Duplicates split across multiple API requests (different CDF partitions/workers) → each request succeeds individually → the transformation completes with status "success", but which version of the node was actually written to the knowledge graph is unknown and non-deterministic ❌

Problem

The second case is the dangerous one:

  1. Silent and invisible — the run reports success, no alert is triggered, no engineer investigates. The data in the knowledge graph may be incomplete or wrong with no trace.
  2. Non-deterministic — which duplicate "wins" depends entirely on which CDF worker flushes first. Two identical runs on the same input can produce different results.
  3. No user control — there is no way to express intent: "fail if duplicates exist", "always keep the latest", "deduplicate before write". The behavior is undefined and undocumented at the transformation level.

Feature Request

Add a conflict resolution option at the transformation level for duplicate externalId within a single run:

Option

Behavior

fail

Abort the run and report an error if any duplicate externalId is detected across all batches

keep_last

Last row processed wins

deduplicate_before_write

CDF deduplicates globally before dispatching to the API

2 replies

Peter  Arwanitis
Practitioner ⭐️⭐️⭐️
Forum|alt.badge.img
  • Practitioner ⭐️⭐️⭐️
  • May 18, 2026

Hi ​@Oussama ALLALI thank you for sharing the feature-request on cognite-hub!

We started internal discussion along the three options, can you help us to detail the requirements?

  1. `keep_last` seems to be a special case of `deduplicate_before_write`?
  2. both require additional information to clarify which duplicate to keep? Like a column with time-stamp, or another sort-able column with unique values?

Can you explain based on your real-case, some example data and which to keep?

 

Hopefully that makes it easier to understand and discuss

thank you

(=PA=)

 


  • Practitioner ⭐️⭐️⭐️
  • May 19, 2026
Hi,

Unfortunately there is a significant cost (performance wise) to adding a deduplication step, which would be a waste of resources in most cases.
Right now it's already possible to deduplicate using SQL, something like:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY key ORDER BY lastUpdatedTime DESC) AS rn
FROM table
) t
WHERE rn = 1

Where incoming table could be either a table directly or a CTE containing your result.
We encourage you to apply such a deduplication on a case by case basis when it is needed and not systematically.

We are working on updating the documentation around this