Why
When starting out in Cognite Data Fusion (CDF) project, it's natural to start by creating data governance elements like groups, datasets, and Raw databases from the CDF user interface. But as the solution begins to scale, you'll quickly realize that it is demanding to set up a detailed configuration handling multiple solutions, sources, roles, and other dimensions. For scaling, precise control is needed for access management and data governance and to enforce the guidelines and rules across the solution.
The problem is not trivial, and a good way to solve it is by replacing the manual approach with a configuration-driven system, where the configuration language supports higher-level concepts for data lineage and access control. With configuration files as the foundation, you can set up an automated DevOps process and review and approve any changes to the structure before they are deployed. This approach also dramatically simplifies sharing the same configuration across multiple environments.
Furthermore, this method adds traceability, reproducibility, and transparency to your solution. If done correctly, it can minimize or disable manual configuration changes and prevent risk to the governance-controlled process. Configuration files make it easy to create new groups, datasets, etc., by duplicating parts of the file and then adapting the parts that need changing.
When do I start
The best time to start is at the beginning of the CDF setup process. The longer it goe before these elements are put into a system, the more challenging it is and the more work it takes to get things in order. Take the time to set things up correctly from the start.
Where do I start
The CDF API, the Cognite Python-SDK, and the Cognite extractor-utils allow you to set up flexible tooling for handling the setup. And based on these libraries, there is already a community tool available, cognitedata/inso-bootstrap-cli: CLI and GitHub-Action to configure and maintain CDF Projects (Groups, Data Sets, RAW DBs)
This inso-bootstrap CLI significantly simplifies the access-control configuration and supports data lineage from raw data sources to final data products with human-readable configuration files built for scale.