The below is reflecting current thinking from the App Dev Journey team in Cognite and is a mental model that will likely develop over time as this topic mature.
A data model enables users to customize the shape and structure their expectation of data. It plays a crucial part in building solutions (like data science models, mobile and web apps). It is also the core of an ontology, knowledge graphs, or industry standard.
There are some crucial reasons why data models are effective for the industrial space. Data modeling enables explicit language, flexible customization, governed iteration, and enhanced accessibility towards data. Let's dive further into each of these qualities of data modeling.
Data Model is Explicit
A data model needs to be explicit because it provides a clear contract/interface between data providers and consumers. Hence, those loading the data and using the data can both understand the underlying data and use it correctly. By having explicit data modeling, a shared context and vocabulary can be established. This avoids situations like the following conversation and makes communication around data clearer.
Data Model is Flexible
An additional requirement is that data models need to be flexible and customizable. Each data model will have various levels of flexibility and iteration speed. In the case of building solutions (like an application or data science models), the complexity of the end solution changes over time, causing the data expectation/structure to scale with it. In larger data models (like those for ontologies or information models), there will be various governed (more on this in a later section) versions. In these cases, multiple attempts from different perspectives (i.e. maintenance vs. operations) could be happening simultaneously. The level of customization must also satisfy the different ways of expressing the data - not only data types constructed from text, numbers, enums, lists fields but also complex relationships between these data types.
Data Model is Governed
While making iterations to the data model, the underlying data may need to co-exist via various versions or data models. Hence destructive changes on the data needs to be governed so that should many data model share access to the same data, no unexpected breakage happens for users of the data via a different data model. Additionally, data models themselves requires governance. Smaller models for applications would likely require testing of data model first before publishing a new version. For larger data models mapping out ontologies, there could be various level of business processes. Specifically, for modeling industry standards, there should be significant review process before an iteration can be done. The data model must be able to support these levels of governance.
Data Model is Accessible
Modeling the data is not useful unless the data within the data model is easily accessible. This needs to happen in the 2 flows - ingestion and consumption. Consuming the data must also be easy to use - they need to have a powerful and intuitive way to access data.
Powerful
Ingesting data into the data model must be smooth with quality assurance via monitoring and alerts. The additional explicity should enable better error handling overall and lead to a better experience working with ETL and moving data. Querying the data should also benefit from data modeling with better filtering, search and aggregation capabilities. This comes from the ability to fetch data with all variety of complex filters - supporting querying on all the complexity of the data model, including relationships between different data types.
Intuitive
Consuming and providing data to the data models must not come at a cost of relearning all the tooling around CDF. Minimizing the amount of learning, UI for validating the data model and the data within, exploring via building queries visually, connecting directly to your dev environment like an SDK with compile-time checks. All existing tooling should be revamped to fully leverage data modeling at a high level and many more tooling can be provided out of the box to make it efficient and easy to work with data modeling. For Cognite, this means that ingestion should at least be as easy as Transformations and consumption should be at least as easy as Fusion's Data Exploration and various SDKs.
Different Types of Data Models
In the industrial world, there are 3 categories of data model, each serving different purposes with different levels of complexities
1. Source Data Models
Reflects how the data was represented in the original application or source system it is coming in from. The explicity and flexibility here are dictated by the source data model, if that exists. Here a strong level of governance is needed on the data coming in, with higher need for monitoring, alerting. The data must also be accessible enough for validation, but the queries are not too complex.
Examples of this type of data model could be any data warehouse or source (i.e. SAP Work Orders, DMC alerts).
2. Domain Data Models (ontologies)
This is captured by the Asset Hierarchy in CDF today. However, what we see often is the need for additional ontologies to be modeled. This also sometimes expressed in terms of a need to model a high level knowledge graph, industry standards (like CFIHOS, CNC, ISA95 etc.). There will be multiple "domain" data model working closely with each other, contextualized to reference the same data underlying, providing different perspectives. Domain data models would iterate through adding additional data models from Source and Solution data models as it iterates, especially for the knowledge graph use case. In cases of industry standards, these will likely be read-only data models that data can be ingested and consumed.
These data models will also require explicity, but the flexibility will differ depending on the type of domain data model. The data model can accommodate for as much iteration as the business needs requires. Additionally, governance is really determined via business processes which can dictate how rigorous an access control to set on the data model and the data within. Accessing the data will require higher level of complexity as there would be a ton of data to filter through.
Examples of this type of data model could be any ontology, or industry standard (i.e. OPCUA).
3. Solution Data Models
The most iterative of the types of data models, these represents the expectations of data for a use case, application or solution. These data model will reference and connect to data in data from all other (source, domain and solution) data models. In the iterative process of developing solutions, explicity and flexibility is at its highest, especially when collaboration between various stakeholders is high (i.e. software developer and subject matter expert working together). Governance is the least complex compared to the other 2 types of data modeling in early iteration phases, increasing in complexity over time. Accessibility also needs to be strongest for these data models as they require easy integration into development environment and other low code applications.
Examples of this type of data model could be any solution or application (i.e. my production dashboard, maintenance app)
How different are they?
While there is a lot of differences between the use cases of each type of data models, the features largely overlap. All the data models needs to be explicit, flexible, governed and accessible. Data flows between them from source to domain to solution
Source Data Model | Domain Data Model | Solution Data Model | |
---|---|---|---|
Explicit | Must | Must | Must |
Flexible | Moderately | Moderately | Extremely |
Governance | Advanced | Very Advanced | Less Advanced |
Accessible | SDK + API | SDK + API + UI | SDK + API + UI + Visual Components |