Skip to main content
Question

Best approach for bulk downloading large number of files from CDF

  • May 6, 2026
  • 2 replies
  • 29 views

Forum|alt.badge.img+2
  • Committed ⭐️

Hi Everyone

We're looking for guidance on the best approach to bulk-download file contents from CDF. We have approximately 295000 files across several datasets that we need to download. We've tried the Cognite Toolkit CLI's cdf data download files --include-file-contents, which works but has a hard limit of 100 files per invocation and requires manual filtering to stay under that cap which is not feasible at our scale.

Questions for the community:
1. Has anyone done bulk file downloads at this scale? What approach worked best?
2. Is cognite-cdffs suitable for ~295k files, or is the SDK with manual batching more appropriate?
3. Are there any rate limits or throttling considerations at this volume?

Thank you in advance!
David Shaji George

2 replies

Mithila Jayalath
Expert ⭐️⭐️⭐️⭐️
Forum|alt.badge.img+8
  • Expert ⭐️⭐️⭐️⭐️
  • May 7, 2026

@dgeo I’ll check on this with the engineering team and get back to you with an update.


  • Practitioner ⭐️⭐️⭐️
  • May 7, 2026

Hello, and thank you for your question.

Currently, we do not offer a native, bulk archive export mechanism of that scale. However, you can achieve a full system export by creating a custom script using one of our supported SDKs or by interacting directly with the REST endpoints.

The CDF Files REST API includes an endpoint capable of generating up to 100 download URLs in a single request. You can build your export process around the following workflow:

  1. Paginate through your files: Use the List Files endpoint with a cursor and a limit of 100.

  2. Process each batch: For a given batch of 100 files:

    • Generate the download URLs.

    • Download the file blobs.

    • Archive the blobs.

Best Practices & Considerations

  • State Management: We highly recommend storing the pagination cursor persistently (e.g., on a local disk or in a database). This allows your script to easily resume from where it left off in the event of a network or intermediate error, simplifying your tracking to just the current batch and the cursor.

  • Performance & Parallelization: The actual downloading of the files will be the most time-consuming step. Fortunately, this process can be parallelized to improve speed.

  • Cloud Limits: Please keep in mind that bandwidth and rate limits vary between cloud providers. You may need to experiment to find the optimal balance between download speed and script complexity, especially considering that file sizes can be massive—in theory, a single batch of 100 files could equal up to 100 TB of data.

I hope this helps point you in the right direction for building your export script!