Skip to main content

I am doing some ETL jobs in Azure Databricks and have successfully managed to use Cognite’s Spark Data Source to read and write time-series, datapoints etc from and to CDF. I know that databricks itself is a cloud platform. However, it is interesting for me to be able to run some or all of the jobs locally during development phase. I wonder if it is still possible to somehow test-run Spark jobs locally? The configuration does not seem to be trivial. 

 I played a little bit with PySpark, and I was able to run it on my Mac but I could not create a connection to “cognite.spark.v1” to read or write data. 

Do you know if it is possible to perform such operation? If not, what would you suggest? 


 

Hi!

Running our Spark Datasource with Spark set up locally should be fine, and if you’re able to run PySpark you should have access to the spark-shell command! Our docs give you a helping hand here https://github.com/cognitedata/cdp-spark-datasource/#quickstart, but the command is simply this

./bin/pyspark --packages com.cognite.spark.datasource:cdf-spark-datasource_2.12:2.0.10

Make sure to run it from the right directory (the command here should work if run from the folder where you have Spark installed), and please check the Scala version of your local installation and match it (I suggested 2.12 here, which is likely what you have)


Thanks for the reply @Håkon Trømborg . I got a dependency error when I ran this command, but I got a dependency error. My Scala version is 2.13.8. I can install Scala 2.12, but where can I get a list of all com.cognite.spark.datasource releases? 

 


https://github.com/cognitedata/cdp-spark-datasource/releases
Releases are here, but I see we haven’t been keeping it fully up to date. 2.13 should also be fine, do you get an error when you use 
./bin/pyspark --packages com.cognite.spark.datasource:cdf-spark-datasource_2.13:2.0.10
?


Yes, it does not find the dependency. 

 

:: com.cognite.spark.datasource#cdf-spark-datasource_2.13;2.0.10: not found

 is the top error message. 


Does this help? 

File "/opt/homebrew/Cellar/apache-spark/3.2.1/libexec/python/pyspark/java_gateway.py", line 108, in launch_gateway
raise RuntimeError("Java gateway process exited before sending its port number")
RuntimeError: Java gateway process exited before sending its port number

I can otherwise use Python3, Scala, and Java on my Mac (M1)


Seems I’m also a little bit out of date here, we made a breaking change some time back and changed the name of the artifact. Does this work?
com.cognite.spark.datasource:cdf-spark-datasource-fat_2.13


Some progress here, but still an error. It finds the package now, but new complaints. First some warnings which are suspicious 

 

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/homebrew/Cellar/apache-spark/3.2.1/libexec/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

 

and then this error 

 

:: loading settings :: url = jar:file:/opt/homebrew/Cellar/apache-spark/3.2.1/libexec/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: com.cognite.spark.datasource:cdf-spark-datasource-fat_2.13

maybe it wants ::Version? 


Ah right, yes it wants the version (didn’t paste the full coordinate in my latest comment), so please add 2.0.10.

WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/homebrew/Cellar/apache-spark/3.2.1/libexec/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)

This error mentions spark 2.12, could it also be that your local Spark installation happens to be Spark 3.2.1 on Scala 2.12? 


Some more progress :) 

I get initially no error if I start PySpark with this command

pyspark --packages com.cognite.spark.datasource:cdf-spark-datasource-fat_2.13:2.0.10

and no, I have 2.13.8

>> scala --version
>> Scala code runner version 2.13.8 -- Copyright 2002-2021, LAMP/EPFL and Lightbend, Inc

I appreciate the support so far @Håkon Trømborg 

But I still have problems loading the data. I am trying to load some datapoints with this command

spark.read.format("cognite.spark.v1").option("type", "datapoints").option("apiKey", MY_API_KEY).option("project", MY_PROJECT).load() 

where I originally have MY_API_KEY and MY_PROJECT as clear text, to avoid any bugs. 

is this command correct? 

The same command works in Azure Databricks notebook though, with the same values for apiKey and project

The error message is very long but here is a summary

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/homebrew/Cellar/apache-spark/3.2.1/libexec/python/pyspark/sql/readwriter.py", line 164, in load
return self._df(self._jreader.load())
File "/opt/homebrew/Cellar/apache-spark/3.2.1/libexec/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/opt/homebrew/Cellar/apache-spark/3.2.1/libexec/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/opt/homebrew/Cellar/apache-spark/3.2.1/libexec/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o52.load.
: java.lang.NoClassDefFoundError: Could not initialize class cognite.spark.v1.Constants$

 

Any ideas? 

 


From the previous error it looks like your Spark installation is Spark 3.2.1 on Scala 2.12, and I’m suspecting this might cause that type of issue.. so I’d suggest downloading a Spark release with 2.13 as a start!


Ok, it seemed like I needed to start PySpark with this package

pyspark --packages com.cognite.spark.datasource:cdf-spark-datasource-fat_2.12:2.0.10

My system Scala installation is 2.13 but apache-spark has its own installed using HomeBrew (well, that’s what Brew does I guess). After this everything else works and I can download datapoints 🙂 Thanks again @Håkon Trømborg  


Great! Thanks for asking the question also, helped discover some problems in our docs 🙂 Have fun Sparking!


Reply