"Many-to-Many" relationship between "Extraction Pipelines" and datasets

Related products: Extractors

Hi, 

I am not sure if this is already on your roadmap, however I wanted to share an idea that could potentially improve the current data management processes in CDF. Our team utilizes for some data integrations a single extractor to fetch data that multiple datasets uses. This is done to minimize the load on our source enterprise databases, and reduce the maintenance of having a single larger extraction rather than many smaller extractions. This approach, presents a challenge in monitoring and lineage visibility, as each extraction can only be linked to only one dataset.

The core of the proposal is to consider a "many-to-many" relationship between Extraction Pipelines and Datasets. This would allow us to maintain our integrations easier while significantly improving lineage visibility for all related datasets. Implementing such a feature could provide a clearer overview of data dependencies, that I believe other firms than AkerBP would also find useful. 

@Jørgen Lund Maybe something for you to look into with your monitoring improvements that is ongoing.


Thanks @MatiasRamsland and @Øystein Aspøy.

Just for clarifiation: is the extractor in this example writing to multiple datasets directly, or to a single dataset that is subsequently processed/moved into a number of other datasets?

In general I think this sounds like a good idea, and we’ll consider it as part of the improvements we are planning for monitoring.

 


For clarification @Jørgen Lund: The extractor is writing for multiple datasets. Hence multiple datasets rely on one extraction pipeline.