CDF Modelling Challenge - Open Lineage

  • 14 October 2021
  • 2 replies
  • 153 views

Userlevel 4
Badge

Hi everyone,

We in C4IR Ocean are starting a series where we are challenging the community to model a certain problem in CDF. The aim with this series is to facilitate discussion and invite community members to share interesting solutions and techniques.

The first challenge focuses on Open Lineage

OpenLineage is an Open standard for metadata and lineage collection designed to instrument jobs as they are running. It defines a generic model of run, job, and dataset entities identified using consistent naming strategies. The core lineage model is extensible by defining specific facets to enrich those entities.

In C4IR Ocean, we are dealing with data from multiple providers. Some datasets are open and publicly available, others are closed. It is therefore important to keep track of where data is coming from, and what transformations have been applied to the data after it is read from the provider.

We are seeing a good landscape of data lineage solutions - both open and closed source. At the same time, the number of scheduling and compute frameworks is rapidly growing. As a result, the need for standardization is rapidly growing.

Open Lineage is aiming to provide this standardization for data lineage similar to how OpenTelemetry is attempting to provide a standardize way of handling metrics, logs and traces. Schedulers and compute-frameworks need only worry about a single interface, and connections to the lineage framework is handled on the backend. This means that different schedulers such as Airlfow, Dagster and Prefect, as well as compute-frameworks such as Dask and Spark, all speak the same lineage-language. It also enables us to swap between lineage frameworks without having to make changes to every single pipeline.

Our challenge to the CDF-community is simple - how would you model the Open Lineage specification in CDF? What APIs would you use and why?

We are inviting anyone in the community to participate in the challenge. There is no limit on what APIs to use. Both general availability APIs (V1) or playground APIs are welcome.

Looking forward to seeing what you come up with :)


2 replies

Userlevel 2

Hi.

 

Two ways you could go about this: 1) easy and 2) hard :). The community can probaly come up with other ideas as well. 

Starting with 1). The core types in open lineage, dataset, job and run map very well with the CDF resource types/api endpoints “data set”, “extraction pipeline” and “extraction pipeline run”. The reason I list this as “simple” is because in this case you have extraction pipelines (jobs) feeding data to data sets (dataset)--there is no multi-step DAG. This model will give you nice, basic lineage + you can use CDFs UI and built-in monitoring and alerting. And, it is all in v1. I recommend that you investigate if this can fit your needs. More information here: https://docs.cognite.com/cdf/integration/guides/interfaces/about_integrations.html

 

The more complex alternative, 2), is to use CDFs data resource types to represent you DAG/network. Open lineage’s “dataset” can be represented as CDF assets, the “job” is also an asset, with CDF relationships linking them into a DAG. “Runs” can be modelled as events. This can give you a richer data model but will also give you more work on the client side as you have to interpret this model in order to get to the information value. 

Best,

Kjetil

Userlevel 4
Badge

Hi @Kjetil Halvorsen,

Thank you so much for your response.

I really like how you approach this problem in two ways. The easy solution seems particularly relevant to teams who have not set up the necessary infrastructure for data processing.

The “hard” solution is however more relevant to Open Lineage. It makes perfect sense to use events to model runs the way you describe, and having the other object-types be assets.

I wonder if templates could fit this scheme? Open Lineage does essentially have a GraphQL definition already (link). From what I understand, relationships are not supported by CDF Templates yet, but templates could allow us to create job views and dataset views?

Reply