Skip to main content
Question

Disaster Recovery for CDF

  • August 29, 2025
  • 1 reply
  • 65 views

Hi Team,

 

We are planning to create Business Continuity and Disaster Recovery document for CDF in our environment and using https://docs.cognite.com/cdf/trust/security/reliability_intro/ 

But, we are missing some information regarding the RTO and RPO for different scenarios,

  1. Full cluster restore 
  2. CDF project restore

Could you please help us on this.

Thanks

Abhra

1 reply

joar.saether
Practitioner
  • Director Site Reliability Cognite
  • September 10, 2025

Hello Abhra,

 

The document you refer to has a siebling page that also contain some relevant information for the Disaster Recovery offering.     See this page:  https://docs.cognite.com/cdf/trust/security/availability_continuity/

As for the Restore Time Objective, you can see that our backup schedule and restore times differ for different resource types.  If we are talking about “worst case”, you can observe that restoring backups of datapoints in time series and sequences might take 5 business days.   When we perform data restore from backups, we will typically lock the environment undergoing restore until the entire restore job is over.   Hence, if time series datapoints take 5 days to restore, all the other resource types will be equally slow to restore.   It remains to clarify that for most clusters, the data volumes in data points are small enough to perform the restore within a day or so.  For your planning, however, the correct number to use would be 5 business days.

RPO  -  The restore point objective for most resource types can be selected within a minute resolution, as the data stores we use offer point in time restore.  For a Disaster Recovery where multiple resource types are involved, the RPO will be selected to a time where all resource types can be restored to the same point in time because we want to provide a consistent data model when your access to the CDF project is resumed.    
Time Series and Sequences data points are backed up once per week.   This causes our worst case RPO to be 7 days back in time if a DR job need to start when the latest backup still is running. 

The weekly backup frequency of data points have made it necessary to introduce additional measures to restore data points to times in between each time when a backup job is started.   Data points are stored in foundation dB.  The data is replicated to 3 different instances, and the database has versioning on all data.  For a Disaster Recovery of the type “project restore”, Cognite is able to roll back your datapoint data store to the state it had within a resolution of a minute between today and the last backup that was run.  If you are able to keep the traffic towards the time series and sequences services down to a minimum, the version rollback operation typically takes hours, but not many days. 

Let me know if you find the above helpful.  
If you have additional data to share about your DR plan and BCP I can help guiding you some more.