cdf_nodes, indexes and broadcast joins in transformations

Question

Hello, I have two questions:When using cdf_nodes in transformation, does query optimizer take advantage of the container indexes? What we notice is it is not the case. 	How can we run broadcast joins in transformation. Hints does seem to be working: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.htmlThank you!

Jacob Eliat-Eliat · Answer

* When usingcdf_nodes in transformation, does query optimizer take advantage of the container indexes? What we notice is it is not the case.Behind the scene, it uses the endpointhttps://api-docs.cognite.com/20230101/tag/Instances/operation/advancedListInstanceandhttps://api-docs.cognite.com/20230101/tag/Instances/operation/syncContent when using is_new. Spark is indeed not aware of the indices and joins are not pushed down, but filters are pushed down, so if you have a filter backed by an index, you’ll see a performance boost from that (side-note is that some filters are not pushed down when using is_new, but generally that doesn’t cause issue as the load from that endpoint is lower).*How can we run broadcast joins in transformation.You are right that hints are disabled. As most clusters are shared clusters, we have to put strong limitations on what can be broadcast and user agency to decide it, otherwise it’d be too easy to accidentally bring down the cluster by mistakenly broadcasting too much. Unfortunately you won’t have direct control over broadcasting in transformations, though I believe that the spark optimizer will still broadcast if it detects situations where it makes sense, but I couldn’t tell you exactly how it decides itTransformations has to support a lot of load coming from customers with very different workflows, so unfortunately we had to put limitations on what can be done with spark there to limit noisy neighbor issues, and self-inflicted cluster crashes

Sign up

Welcome to Cognite Hub

Scanning file for viruses.

This file cannot be downloaded