Hi there,
Before I begin, some terminology to make sure we are on the same page.
CDF tenant = CDF project (these have been used interchangeably over time, not sure which is applicable at the moment, I will be referring to it as tenant)
CDF cluster = The cloud cluster where the CDF tenants/projects/instances are hosted
Now, let’s begin.
At Aker BP we have been migrating over to a dedicated CDF cluster on Azure and we have been experiencing some pain points with regards to the compute resources being shared for all tenants on the cluster. Previously our CDF tenants were hosted on GCP on a cluster shared with other companies, this cluster is scaled to handle multiple customers and so the base scaling of the compute resources is much higher than it would be on a dedicated cluster, and as such these problems are less likely to be encountered, not impossible just not as likely.
The problem we have seen is that if we from our dev tenant overload the API then the API becomes unavailable for all our tenants, in other words we are able to bring down our production tenant from our development and test tenants, which is not good. I want to mention that this has so far been limited to one resource/endpoint at a time, so each outage is limited to the resource that has been overloaded and not the whole of the CDF API.
We have primarily seen this issue occur for diagram parsing and Cognite transformations, but the compute being shared is common practice across the backend resources of CDF. The symptoms we see when it occurs are generally “service unavailable”, “too many requests”, or simply excessively long response times from the API. If we in dev send enough requests to reach the cluster limits then we in production can get the “too many requests” error with a singular request. For Congite transformations, One of the “solutions” we have been presented is to schedule them at different times in different environments. This would work as a temporary mitigation until a real solutions is implemented, but it does not seem right that we have to take what happens in dev into consideration when we deploy something in production. In our opinion production should be unaffected by dev.
Let me give an example where I have been able to make production unavailable at will just by running a relatively small load in dev.
We have job that uses the “diagram.detect” endpoint to parse/annotate P&ID documents. The theoretical limits of this endpoint is 100 batches in parallel with 50 files per batch in each tenant. But when I run 10 batches with 50 files each in dev, then that has shown itself to be enough to saturate the compute resource making it unavailable for the rest of the cluster. During the test I had a small job running in prod, 1 batch with 1 file, each batch taking 5-10 seconds, but when starting the job in dev the response time in prod went up to 1-2 hours. We have four tenants; sandbox, development, test, and production. The theoretical limit is 100 in parallel per tenant and with four tenant that is 400, so if 10 in parallel in one tenant is enough to saturate the whole cluster, then we are at a 2.5% theoretical load, I call that relatively small.
My concern is not the performance, but the integrity of our production tenant. From our perspective nothing that we do in dev or test should be able to bring production to its knees. But at the moment the ones working in dev has to take into account that if they run too big of a load at the wrong time it can negatively affect what we have in production.
It is one thing that I do this in a deliberate and controlled manner in dev, but we also have a sandbox tenant which is for anyone in our company to use and experiment with. It has no real data, and nothing done there could possibly affect anything else, so what could possibly go wrong? Me breaking production from dev on purpose to demonstrate a point is acceptable (bad, but acceptable) and temporary, but anyone being able to accidentally do this from our sandbox tenant would be a disaster.
Some of these issues is due to us migrating to a dedicated cluster that probably has a base scaling that isn’t enough to handle the load we put on it, but these issues existed on the shared cluster as well, but to a somewhat different degree. On the dedicated the only ones we affect are ourself, but on the shared cluster there has been occasions where incidents have affected all customers sharing the cluster. So for a shared cluster it is less likely to happen, but the impact is potentially much larger.
I understand some of the reasoning behind having a shared compute across the cluster.
- Easier management of resources and subsystems
- Dynamic scaling beyond the initial limits of each tenant
Even though I understand these consideration, what I need to understand is the risk assessment done regarding one tenant having the capability to bring down all tenants on the cluster. Scaling the cluster mitigates the problem, but does not eliminate the risk. I do not know what a perfect solution would be, but we need some form of guarantee that our production tenant is safe from what is done in other tenants.
For us, this is no longer a hypothetical issue, this is now blocking some of our integrations from going to production with their intended SLAs.
Regards,
Markus Pettersen
Aker BP - CDF Data Delivery - Tech Lead