Skip to main content
Gathering Interest

Support for Cron-Based Scheduling in Hosted Extractors

Related products:Extractors
  • September 2, 2025
  • 6 replies
  • 89 views

Forum|alt.badge.img+1

Today, hosted extractors in Cognite Data Fusion only allow users to define a frequency for execution. However, the actual execution time within that interval is randomly assigned. While this may be sufficient for some use cases, it falls short for many real-world scenarios where precise control over execution timing is critical.

For example, in cases where:

  • Extractors must run after upstream systems complete their data dumps (e.g., at 03:15 every night),
  • Data pipelines are chained or dependent on each other and require strict sequencing,
  • Load balancing is needed across systems to avoid peak-hour congestion,
  • Or compliance and audit requirements demand predictable and traceable execution times,

The current interval-based scheduling introduces uncertainty and operational risk.

We need support for cron-based scheduling in hosted extractors in order to have trust in, and be able to use hosted extractors in production.

6 replies

Forum|alt.badge.img

Hi ​@Haakon, and thanks for this product idea.

I get that using a schedule aligns “better” with the cases you’re describing, but - frankly - a scheduled job would not actually solve the underlying problem, as I read your proposal.

My read: You’re looking to execute the job whenever the source system has generated data.

AFIK, for the Kafka and MQTT client (and the upcoming MQTT broker) features of the Hosted Extractors, this should be what happens automatically. They read data as it becomes available from the source system.

Load balancing is a _server_ responsibility, not a client one. A client typically connects to an alias from which the server load balances the client connections over time. All of the hosted extractors should be considered clients.

Once Workflows and Hosted Extractors are better integrated chaining becomes a default behavior. This capability is currently “a development sequencing problem” (in other words, solving this problem is unfortunately considered lower priority than what we’ve got the capacity to deliver over the next 6-9 months).

Sorry, I’m unclear as to how a protocol like MQTT or Kafka can ever deliver “predictable execution times” (I read “traceable” as “logged start/end/volume” - is that an incorrect interpretation?).

Total execution time depends on volume of data and normally purely limited by the capacity of source system(s), so “predictability” is a tricky expectation from a CDF perspective.


Forum|alt.badge.img+1
  • Author
  • Committed
  • September 4, 2025

Hi Thomas, for event hub, MQTT, and Kafka extractors you are absolutely right, no need for this here :)  

This goes specifically for the for the hosted REST extractor, I should have specified.

Best,

Haakon


Forum|alt.badge.img+1
  • Author
  • Committed
  • September 25, 2025

Hi ​@Thomas Sjølshagen any thoughts on this? :) 


Forum|alt.badge.img

We intentionally are avoiding human settable schedules for the hosted extractors because we’ve seen so many “thundering herd” problems in other manually scheduled services. 

What would be some (much more flexible) alternative ways of indicating frequency, but preserves the system’s freedom to schedule work when there’s idle capacity, that could possibly meet your needs?


Markus Pettersen
MVP

We have several integrations that run only once a day and we have requirements either of when we are allowed to read from the source of by when it needs to finish. Some critical systems only allow reads at certain times due to other integrations/jobs/processes being more important. We need a way to accurately plan when something is running. Not everything can be event-driven, even if that would be nice.

Unfortunately this will likely cause a “thundering herd” problem, because a lot of this is REQUIRED to run outside of normal business hours (at night or during weekends). We can spread it out to some degree, but these are not requirements that are easy to circumvent. This problem will not go away, and this applies to transformations as well. We will have a lot running at the same time, this is just the reality.


Forum|alt.badge.img

So, one of the roadmap items we do have - on an internal only roadmap as it’s being evaluated - is adding integration between Workflows and Hosted Extractors (has a few “interesting” aspects to it for the reasons described above for the “continuous flow” extractors (Kafka, MQTT and Azure Event Hub).

It’s a lot easier from a REST extractor perspective where a (my) hypothesis is that the completion of the read operation by the REST extractor can act as a trigger for a Workflow (for instance)