Skip to main content
Question

cdf_nodes, indexes and broadcast joins in transformations

  • November 5, 2025
  • 2 replies
  • 29 views

Marwen TALEB
MVP

Hello, 

I have two questions:

Thank you!

2 replies

Mithila Jayalath
Seasoned Practitioner
Forum|alt.badge.img+8

@Marwen TALEB I’ll check on this with the engineering team and get back to you with an update.


 * When using cdf_nodes in transformation, does query optimizer take advantage of the container indexes? What we notice is it is not the case.

Behind the scene, it uses the endpoint https://api-docs.cognite.com/20230101/tag/Instances/operation/advancedListInstance and https://api-docs.cognite.com/20230101/tag/Instances/operation/syncContent when using is_new. Spark is indeed not aware of the indices and joins are not pushed down, but filters are pushed down, so if you have a filter backed by an index, you’ll see a performance boost from that (side-note is that some filters are not pushed down when using is_new, but generally that doesn’t cause issue as the load from that endpoint is lower).

 * How can we run broadcast joins in transformation.

You are right that hints are disabled. As most clusters are shared clusters, we have to put strong limitations on what can be broadcast and user agency to decide it, otherwise it’d be too easy to accidentally bring down the cluster by mistakenly broadcasting too much. Unfortunately you won’t have direct control over broadcasting in transformations, though I believe that the spark optimizer will still broadcast if it detects situations where it makes sense, but I couldn’t tell you exactly how it decides it

Transformations has to support a lot of load coming from customers with very different workflows, so unfortunately we had to put limitations on what can be done with spark there to limit noisy neighbor issues, and self-inflicted cluster crashes