Hi everyone,
We in C4IR Ocean are starting a series where we are challenging the community to model a certain problem in CDF. The aim with this series is to facilitate discussion and invite community members to share interesting solutions and techniques.
The first challenge focuses on Open Lineage
OpenLineage is an Open standard for metadata and lineage collection designed to instrument jobs as they are running. It defines a generic model of run, job, and dataset entities identified using consistent naming strategies. The core lineage model is extensible by defining specific facets to enrich those entities.
In C4IR Ocean, we are dealing with data from multiple providers. Some datasets are open and publicly available, others are closed. It is therefore important to keep track of where data is coming from, and what transformations have been applied to the data after it is read from the provider.
We are seeing a good landscape of data lineage solutions - both open and closed source. At the same time, the number of scheduling and compute frameworks is rapidly growing. As a result, the need for standardization is rapidly growing.
Open Lineage is aiming to provide this standardization for data lineage similar to how OpenTelemetry is attempting to provide a standardize way of handling metrics, logs and traces. Schedulers and compute-frameworks need only worry about a single interface, and connections to the lineage framework is handled on the backend. This means that different schedulers such as Airlfow, Dagster and Prefect, as well as compute-frameworks such as Dask and Spark, all speak the same lineage-language. It also enables us to swap between lineage frameworks without having to make changes to every single pipeline.
Our challenge to the CDF-community is simple - how would you model the Open Lineage specification in CDF? What APIs would you use and why?
We are inviting anyone in the community to participate in the challenge. There is no limit on what APIs to use. Both general availability APIs (V1) or playground APIs are welcome.
Looking forward to seeing what you come up with :)