Question

Best practices for handling large legacy data

Forum|Forum|4 months ago
October 17, 2025
3 replies
131 views

Oussama ALLALI
Committed

Hi,

Context:

We currently organize our data in CDF using a per-country partitioning strategy, where each country has its own space.
This approach was chosen primarily to restrict data access by country in a fine-grained manner.

On top of that, we expose data models grouped by Business Object, such as Well Architecture, Cost Model, etc.
Each Business Object model aggregates several data objects under a common business theme, which also allows us to control access by business domain in addition to country-based access.

We are now planning to integrate a large amount of historical data, which will likely increase our model size by around 4x.
These historical datasets are rarely queried, but we want to make sure their addition does not degrade performance for operational data — both in query latency and data ingestion throughput.

We are evaluating two potential strategies:

Keep everything in the same spaces, adding an indexed attribute (e.g., is_legacy = true) to distinguish legacy records.
Create dedicated legacy spaces per country (e.g., fr_legacy, dz_legacy) to isolate historical data.

Could you please advise whether separating the legacy data into dedicated spaces would actually provide a measurable performance improvement (in terms of query scope or ingestion throughput)?

In other words:

Does space-level separation help reduce query latency compared to filtering on an indexed is_legacy attribute?
Are there any best practices or known trade-offs in CDF regarding space-based vs. attribute-based partitioning, especially for large-scale data models?

Thank you for your insights and any recommendations you can share on this topic.

Arild Eide
Seasoned Practitioner
Forum|Forum|4 months ago
October 20, 2025

Hi @Oussama ALLALI

Thanks for the thorough question and description.

Having discussed your post with the team, the sentiment seems to be that there should not be any big performance differences between adding a boolean attribute is_legacy and using two different spaces per country.

Using the is_legacy attribute would require you to add an index for this property, as you point out. With 20%-80% split between true and false this index would not very selective, possibly making postgres resort to a sequence scan, despite the presence of an index.

So the recommendation is probably that you should model this using a separate legacy space per country. It would eliminate the need for creating and maintaining an additional index as well as allowing you to grant access to legacy records separately. The Space concept is also a the key mechanism to partition data.

In general, there is this article on perfomance considerations and another one on debug notices for query optimizations.

Regards,

Arild Eide