Hi,
I am not sure if this is already on your roadmap, however I wanted to share an idea that could potentially improve the current data management processes in CDF. Our team utilizes for some data integrations a single extractor to fetch data that multiple datasets uses. This is done to minimize the load on our source enterprise databases, and reduce the maintenance of having a single larger extraction rather than many smaller extractions. This approach, presents a challenge in monitoring and lineage visibility, as each extraction can only be linked to only one dataset.
The core of the proposal is to consider a "many-to-many" relationship between Extraction Pipelines and Datasets. This would allow us to maintain our integrations easier while significantly improving lineage visibility for all related datasets. Implementing such a feature could provide a clearer overview of data dependencies, that I believe other firms than AkerBP would also find useful.
Thanks@MatiasRamsland and @Øystein Aspøy .
Just for clarifiation: is the extractor in this example writing to multiple datasets directly, or to a single dataset that is subsequently processed/moved into a number of other datasets?
In general I think this sounds like a good idea, and we’ll consider it as part of the improvements we are planning for monitoring.
For clarification@Jørgen Lund : The extractor is writing for multiple datasets. Hence multiple datasets rely on one extraction pipeline.