Skip to main content
Answer

Newly encountered Read Timeout error

  • May 23, 2025
  • 11 replies
  • 106 views

Forum|alt.badge.img

Hi Hub,

 

We have some backend code running as a Github Action that fetches events using `clients.events.list` over a period of 1 year. The fetching is of course chunked into smaller time frames to avoid memory overload. This has worked just fine until recently where we are frequently encountering a

CogniteReadTimeout

error, which can be traced back to the following HTTP request error:

requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='az-ams-sp-002.cognitedata.com', port=443): Read timed out. (read timeout=30)

The error occurs inconsistently and not necessarily at the same point in time of the workflow run.

 

Have there been any recent changes/restrictions in the Cognite SDK that could explain the cause of the issue? As mentioned, some weeks ago the workflow succeeded without problems.

 

For additional context, we are using multiple threads to fetch events in parallel over metadata, for events within same time interval. Also, we use multiple cores to parallelize fetching and subsequent processing over different time intervals of the 1 year time frame.

 

Thanks!

 

Best regards,

Vetle.

Best answer by Sergei Voronichev

@Vetle Nevland Is it possible to slow down the writes with some controlled rate or reduce the number of parallel jobs?

11 replies

Mithila Jayalath
Seasoned Practitioner
Forum|alt.badge.img+8

@Vetle Nevland in order to further troubleshoot your issue, can you please provide the Cluster, Project name, and Request IDs (or some time periods where this occurrs if no request IDs are available).

 


Forum|alt.badge.img
  • Author
  • Committed
  • May 26, 2025

@Mithila Jayalath Cluster name: az-ams-sp-002. Project name: abp-dev. I didn’t store any Request ID, but the workflow started 2025-05-23 13:24:35, and failed at 2025-05-23 13:43:14.


@Vetle Nevland For further investigation: could they provide an example of the list request (payload) the jobs use?

To temporary mitigate the issue: SDK timeout can be changed to 60s by setting the timeout parameter in the ClientConfig object https://cognite-sdk-python.readthedocs-hosted.com/en/latest/settings.html#client-configuration

To give the storage more “oxygen”: Is it possible to delete not used events for the projects? I see the project has 350M events in the storage now.


Forum|alt.badge.img
  • Author
  • Committed
  • May 27, 2025

@Sergei Voronichev The payload the jobs use that triggers the timeout error is: 

client.events.upsert(item=chunk, mode="patch")

where chunk is a list of `EventWrite` objects scoped to our dataset (1000 events per chunk).

We have tried increasing the timeout to 60 seconds, but experience same error.


@Vetle Nevland 
>but experience same error.
do you mean this one with `timeout=60`?

requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='az-ams-sp-002.cognitedata.com', port=443): Read timed out. (read timeout=30)

@Vetle Nevland In the ticket description it’s stated your app reads events via `clients.events.list` but now your re writing the errors are on `client.events.upsert`. I’m not fully get the flow. Do you have a request id for failed request?


Forum|alt.badge.img
  • Author
  • Committed
  • May 27, 2025

@Sergei Voronichev Correct, same error with timeout=60.

 

Sorry for the confusion. We indeed read events using `client.events.list`, but after a closer look on the log, the error seems to be traced back to `client.events.upsert`. The flow is:

  • Read events from CDF (a dedicated read dataset) using `client.events.list`
  • Process and filter the events using Cognite Python SDK
  • Write events back to CDF (in a dedicated write dataset) using `client.events.upsert`

Where can I find / get the request ID of a failed request?

 

Error:

 

Traceback:

 


@Vetle Nevland Is it possible to slow down the writes with some controlled rate or reduce the number of parallel jobs?


Forum|alt.badge.img
  • Author
  • Committed
  • May 27, 2025

@Sergei Voronichev Ok, we will try to adjust write rate and/or number of jobs, and see if we can observe any bottlenecks.

 

Will keep you updated.

 


@Vetle Nevland have you manage to experiment with load? Have you considered the option to delete old test data?


Forum|alt.badge.img
  • Author
  • Committed
  • June 2, 2025

@Sergei Voronichev Yes, have done some experimentation now.

 

Bad news:

  • We are only able to delete events from our own dataset, so unfortunately we can’t do much about the 340M events from other datasets.

Good news:

  • Increased read timeout to 90 seconds, and reduced number of events upserted in one go from 1000 to 100. Have done multiple deployment tests, and they now succeed :)

Conclusion:

  • I **assume** the root of the problem was rate limitations in combination with overload of CDF resources. We can only affect the former, but at least this seems to mitigate the error