Manage data governance at scale in CDF


Userlevel 2
Badge

Why

When starting out in Cognite Data Fusion (CDF) project, it's natural to start by creating data governance elements like groups, datasets, and Raw databases from the CDF user interface. But as the solution begins to scale, you'll quickly realize that it is demanding to set up a detailed configuration handling multiple solutions, sources, roles, and other dimensions. For scaling, precise control is needed for access management and data governance and to enforce the guidelines and rules across the solution.

 

The problem is not trivial, and a good way to solve it is by replacing the manual approach with a configuration-driven system, where the configuration language supports higher-level concepts for data lineage and access control. With configuration files as the foundation, you can set up an automated DevOps process and review and approve any changes to the structure before they are deployed. This approach also dramatically simplifies sharing the same configuration across multiple environments. 

 

Furthermore, this method adds traceability, reproducibility, and transparency to your solution. If done correctly, it can minimize or disable manual configuration changes and prevent risk to the governance-controlled process. Configuration files make it easy to create new groups, datasets, etc., by duplicating parts of the file and then adapting the parts that need changing.

 

When do I start

The best time to start is at the beginning of the CDF setup process. The longer it goe before these elements are put into a system, the more challenging it is and the more work it takes to get things in order. Take the time to set things up correctly from the start.

 

Where do I start

The CDF API, the Cognite Python-SDK, and the Cognite extractor-utils allow you to set up flexible tooling for handling the setup. And based on these libraries, there is already a community tool available, cognitedata/inso-bootstrap-cli: CLI and GitHub-Action to configure and maintain CDF Projects (Groups, Data Sets, RAW DBs)

This inso-bootstrap CLI significantly simplifies the access-control configuration and supports data lineage from raw data sources to final data products with human-readable configuration files built for scale. 


2 replies

Userlevel 2
Badge

Hi @Sverre Dorheim , thanks for writing and sharing!

It was a team effort the last months, but it needs one to step up and talk about it :)

As governance (and access-control & data-lineage as part of it) is a highly complex topic, I fully agree that we must have configuration-driven tooling to stay on top from day one and during the operational life-cycle of a project.

As Cognite Principal Solution Architect and author of the mentioned “inso-bootstrap-cli” I’m looking forward for questions and issues how to use this approach and make it part of every CDF project. Please checkout or ask for other tools available.

For issues (feature-requests, bugs, improvements, clarifications, ..) the GitHub Issues tracking is available, which allows our team to follow up.

🤖let’s automate!

Userlevel 2

@Ben Brandt I am interested in your thoughts on this method? Does it alleviate some of the issues you described in handling access to multiple environments?

Reply