Automating the annotation of critical industrial documents, such as Piping and Instrumentation Diagrams (P&IDs), is essential for effective data contextualization. To make this process more robust and scalable, we present a template created for annotating files within Cognite Data Fusion (CDF). This template is currently being used by some of our biggest projects and has cut the annotation time from 16 days to less than half a day when processing ~66,500 files. It is also being adopted by smaller, Quickstart projects with no code change.
This template uses a data model-centric approach that is designed to handle the evolving needs of complex projects. It provides a standardized and flexible way to manage the entire annotation lifecycle, from file selection to final reporting.
For the complete code, detailed deployment steps (including a video walkthrough), and advanced guides, please visit the official repository:
https://github.com/cognitedata/library/tree/main/modules/contextualization/cdf_file_annotation
Key Features
This annotation template is built to be both powerful out-of-the-box and easy to customize:
- Configuration-Driven: The entire workflow is controlled by a single config.yaml file, allowing you to adapt to different data models and requirements without changing any code.
- Large Document Support: Automatically handles files larger than 50 pages by breaking them into smaller chunks and processing them iteratively.
- Ready for Parallel Execution: A robust optimistic locking mechanism prevents race conditions, ensuring stability when running multiple functions concurrently.
- Detailed Reporting and Auditing: All processed annotation details are stored in CDF RAW tables, function logs, extraction pipeline runs, and a streamlit dashboard that comes included with the template, providing clear audit trails.
- Local Development and Debugging: The template includes a pre-configured setup for easy local testing and debugging within VS Code.
How to Set Up the Template
Getting the template running in your CDF project is a streamlined process using Cognite Toolkit.
- Integrate the Module: Add the module to your project and update the config.yaml file with your project-specific details, such as data model views and target entities.
- Create an Environment File: Create a .env file in the root directory to hold your CDF project credentials and connection information.
- Build and Deploy: Use the Cognite Toolkit to build and deploy the module to your CDF environment.
How to Use the Template
Once deployed, the template can be used in two primary ways: as an automated workflow in CDF or run locally for development and debugging.
- Automated Workflow Execution: After deployment, the annotation process is managed by a workflow in CDF that orchestrates the Launch and Finalize functions. This workflow is automatically triggered based on the schedule defined in the Toolkit’s configuration file, continuously processing new files as they arrive. You can monitor the progress and logs of the functions in the annotation pipeline dashboard that comes with the template or in the function logs.
- Local Development and Debugging: The template is configured for easy local execution directly within Visual Studio Code. By using the pre-configured launch.json file, you can run and debug both the Launch and Finalize functions on your local machine. This allows you to set breakpoints, inspect variables, and test your configuration before deploying.
How the Workflow Operates
The template orchestrates the annotation process through a workflow that consists of three main phases:
- Prepare: This initial phase identifies new files that need to be annotated. It queries for files tagged for processing and creates a corresponding AnnotationState instance in the data model to track its journey.
- Launch: The launch function queries for all files ready for processing. It efficiently groups them, fetches the necessary context from the data model, and calls the Cognite Diagram Detect API to begin the annotation job.
- Finalize: Once an annotation job is complete, the finalize function retrieves the results, applies the new annotations to the file, and updates the file's status to "Annotated" or "Failed". A summary report is then written to a CDF RAW table.
Built for Scale and Customization
This template was designed with two core principles in mind: addressing evolving project needs and balancing simple configuration with deep customization.
For most use cases, editing the config.yaml file is all you need to get started. However, when projects demand unique logic or performance optimizations, its interface-based architecture provides an "escape hatch". Developers can implement their own custom Python classes for specialized logic without altering the core template code.
This stateful, data model-driven approach ensures high reliability and performance, using built-in optimistic locking for concurrency and indexed queries to efficiently find files that need processing.
Getting Started and Diving Deeper
The repository's README.md file provides a complete step-by-step guide to get you up and running.
To understand the full capabilities of this template, the repository includes several in-depth guides in the cdf_file_annotation/detailed_guides/ directory:
- CONFIG.md: A detailed outline of the configuration file.
- CONFIG_PATTERNS.md: Recipes for common operational tasks and performance tuning.
- DEVELOPING.md: A guide for developers who wish to extend the template's functionality.
We highly encourage anyone who’s interested to explore the repository, read the detailed guides, and deploy the template in your CDF project.
Check the
documentation
Ask the
Community
Take a look
at
Academy
Cognite
Status
Page
Contact
Cognite Support