Strategic Objective: Establishing CDF as the Definitive Source of Truth
For Koch Ag & Energy Solutions (KAES), the primary goal is for all stakeholders to trust CDF as the most accurate and reliable resource for operational data. To achieve this, we must move away from "silent failures." We require proactive monitoring to ensure that if data is missing from CDF, the system—not the user—is the first to identify and report the gap.
New Requirement: Proactive Integrity & Trust Monitoring
In addition to standard execution logs, the Kafka Connector API should provide hooks for proactive health checks:
- Data Freshness Latency: Real-time reporting of the "age" of the last record written to CDF versus the timestamp of the event in Kafka.
- Source-to-Sink Parity: Automated counters to verify that $N$ records consumed from Kafka equals $N$ records successfully ingested into CDF.
- Proactive "No Data" Alerts: The ability to trigger a log event or status change if a high-priority Kafka topic produces zero records over a defined threshold (e.g., 5 minutes).
Use Cases: The KAES RCA & Trust Agent
The internal KAES Monitoring Agents will use these API enhancements to fulfill two roles:
- The Fixer (RCA): When a pipeline breaks, an agent can utilize the logs to identify the issue and attempt to resolve or give detailed RCA, saving data engineers hours of manual tracing.
- The Guarantor (Trust): If there is a discrepancy between source systems and Cognite models, an agent can assist in reconciling the data. Additionally, if a data stream slows down or encounters frequent issues, the agent can proactively flag these problems and recommend a plan of action for the data engineer to address.
Business Value for KAES
- User Adoption: Increases trust of the Connected Cognite Data Foundation for our Operations partners.
- Operational Excellence: Transitions the engineering team from reactive troubleshooting to managing by exception - our team spends ~50% of time on low value data pipeline issues.
- Scalability:
- Provides a standardized way to monitor thousands of concurrent data streams across the KAES enterprise.
- Increases the speed to deploy internal production ready solutions as data is higher quality
Technical Specifics for Implementation
- Health Status API: A GET endpoint returning the current "Liveliness" and "Readiness" of specific Kafka consumer groups.
- Structured Error Categorization: Distinct error codes for Transient (network), Permanent (schema/logic), and Source (empty topic) issues to allow the KAES agent to categorize the RCA automatically.
- OpenTelemetry Integration: Support for exporting these metrics to external observability stacks.
Check the
documentation
Ask the
Community
Take a look
at
Academy
Cognite
Status
Page
Contact
Cognite Support