Gathering Interest

Transformations — Non-deterministic silent upsert when duplicate externalIds are split across workers: request for configurable deduplication strategy

Related products:Transformations

Forum|Forum|4 days ago
May 15, 2026
2 replies
24 views

Oussama ALLALI
Seasoned ⭐️⭐️

Observed behavior

When a CDF Transformation produces rows with duplicate externalId values , the behavior depends on how CDF partitions the data across workers:

Duplicates within the same API request (same batch) → the API raises an error → the transformation fails visibly ✅
Duplicates split across multiple API requests (different CDF partitions/workers) → each request succeeds individually → the transformation completes with status "success", but which version of the node was actually written to the knowledge graph is unknown and non-deterministic ❌

Problem

The second case is the dangerous one:

Silent and invisible — the run reports success, no alert is triggered, no engineer investigates. The data in the knowledge graph may be incomplete or wrong with no trace.
Non-deterministic — which duplicate "wins" depends entirely on which CDF worker flushes first. Two identical runs on the same input can produce different results.
No user control — there is no way to express intent: "fail if duplicates exist", "always keep the latest", "deduplicate before write". The behavior is undefined and undocumented at the transformation level.

Feature Request

Add a conflict resolution option at the transformation level for duplicate externalId within a single run:

Option	Behavior
fail	Abort the run and report an error if any duplicate externalId is detected across all batches
keep_last	Last row processed wins
deduplicate_before_write	CDF deduplicates globally before dispatching to the API

Peter Arwanitis
Practitioner ⭐️⭐️⭐️
Forum|Forum|1 day ago
May 18, 2026

Hi @Oussama ALLALI thank you for sharing the feature-request on cognite-hub!

We started internal discussion along the three options, can you help us to detail the requirements?

`keep_last` seems to be a special case of `deduplicate_before_write`?
both require additional information to clarify which duplicate to keep? Like a column with time-stamp, or another sort-able column with unique values?

Can you explain based on your real-case, some example data and which to keep?

Hopefully that makes it easier to understand and discuss

thank you

(=PA=)

Like

J

Jacob Eliat-Eliat
Practitioner ⭐️⭐️⭐️
Forum|Forum|18 hours ago
May 19, 2026

Hi,

Unfortunately there is a significant cost (performance wise) to adding a deduplication step, which would be a waste of resources in most cases.
Right now it's already possible to deduplicate using SQL, something like:

SELECT *
FROM (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY key ORDER BY lastUpdatedTime DESC) AS rn
    FROM table
) t
WHERE rn = 1

Where incoming table could be either a table directly or a CTE containing your result.
We encourage you to apply such a deduplication on a case by case basis when it is needed and not systematically.

We are working on updating the documentation around this

Like

Sign up

Welcome to Cognite Hub

Scanning file for viruses.

This file cannot be downloaded