Skip to main content
New

Transformations — Non-deterministic silent upsert when duplicate externalIds are split across workers: request for configurable deduplication strategy

Related products:Transformations
  • May 15, 2026
  • 0 replies
  • 4 views

Observed behavior

When a CDF Transformation produces rows with duplicate externalId values , the behavior depends on how CDF partitions the data across workers:

  • Duplicates within the same API request (same batch) → the API raises an error → the transformation fails visibly ✅
  • Duplicates split across multiple API requests (different CDF partitions/workers) → each request succeeds individually → the transformation completes with status "success", but which version of the node was actually written to the knowledge graph is unknown and non-deterministic ❌

Problem

The second case is the dangerous one:

  1. Silent and invisible — the run reports success, no alert is triggered, no engineer investigates. The data in the knowledge graph may be incomplete or wrong with no trace.
  2. Non-deterministic — which duplicate "wins" depends entirely on which CDF worker flushes first. Two identical runs on the same input can produce different results.
  3. No user control — there is no way to express intent: "fail if duplicates exist", "always keep the latest", "deduplicate before write". The behavior is undefined and undocumented at the transformation level.

Feature Request

Add a conflict resolution option at the transformation level for duplicate externalId within a single run:

Option

Behavior

fail

Abort the run and report an error if any duplicate externalId is detected across all batches

keep_last

Last row processed wins

deduplicate_before_write

CDF deduplicates globally before dispatching to the API