Skip to main content

Test-Driven Extraction: My Journey into Building Custom CDF Extractors

  • January 30, 2026
  • 0 replies
  • 32 views

Lars Skår
Practitioner
Forum|alt.badge.img

Wow—was I in for a challenge!

I just received my first proper assignment after joining Cognite: building a custom extractor to move documents and their metadata from a Document Management System (DMS) into Cognite Data Fusion (CDF).

Diving headfirst into this, I found myself at the intersection of a steep learning curve, the high-stakes data needs of our industrial customers, and my own belief in Agile engineering practices. Coming from a background where Test-Driven Development (TDD) is the heartbeat of quality, I realized I didn't just want to build a script; I wanted to build a process I could trust.

In this series, I want to share a practice that served me well while learning the ropes: The "Twin Auditor" Pattern.

The Challenge: Beyond Logic, It’s Integration

When you build an extractor, you aren't just writing code in a vacuum. You are building a bridge between two distinct worlds: your source system (the DMS) and your data platform (CDF).

Extraction is inherently an integration challenge. Your code must successfully:

  1. Connect to the DMS and handle its specific API/security requirements.

  2. Authenticate with CDF using OIDC.

  3. Stream heavy binary files without corruption.

  4. Map metadata into RAW tables that match the intended schema.

If the integration fails—even if your transformation logic is "perfect"—the data doesn't land. Simply seeing a "Success" log in your Python console isn't enough to prove that the operational handshake actually worked.

The Strategy: Integration-First TDD

Instead of building a "Black Box" and hoping the bridge holds, I decided to build a Twin. Using the cognite-extractor-utils framework, I developed two distinct entry points that share the same DNA:

  • The Worker: The extractor logic that does the heavy lifting of pulling from the DMS and pushing to CDF.

  • The Auditor: A lightweight "Twin" that acts as a continuous Integration Test.

The cognite-extractor-utils framework provides more than 50% of the code you need for a robust extractor (handling auth, retries, and config). I realized I could use that same foundation to build the Auditor. By using the same configuration YAML, the Auditor doesn't just test code—it tests the operational reality: "Can I actually see and verify the data that was just pushed to CDF?"

Combatting the Skepticism: "Isn't Test Automation too much work?"

I know the feeling—we’ve all been under pressure to "just ship it." Many believe TDD or automated integration testing takes too much time.

But here is what I discovered: You are going to test your integration anyway.

You can either test it manually (logging into CDF, hunting for RAW rows, checking file metadata) every time you make a change, or you can plan your test upfront. To me, the best way to plan an integration test is to write a piece of code that runs it for you. It provided a "Cognitive Rail" that kept me focused. Instead of being overwhelmed by the entire pipeline, I only had to care about making one specific operational handshake turn green.

Why this matters for Agile Teams

For Agile teams, this is about clarity at speed.

  • 🔴 The Red Phase: You define the successful integration state in CDF before you start.

  • 🟢 The Green Phase: You write the code to make the handshake happen.

  • 🔵 The Refactor: You optimize your code, safe in the knowledge that your "Twin" Auditor is monitoring the bridge.

What’s Coming Next

In this series, I’m going to pull back the curtain on how I structured this project:

  • Part 2: The Blueprint – Leveraging extractor-utils and Pydantic to create a shared configuration that ensures the Worker and Auditor are always in sync.

  • Part 3: The Orchestration – Using GitHub Actions to automate this integration testing every time you push code.

TDD made my breakthrough possible. It turned a complex integration task into a manageable learning experience. Personally, seeing the benefits of using the extractor-utils framework to drive these tests was a game-changer.

Follow this series for more, and let me know in the comments if you’ve struggled with "Black Box" extractions!