Hi!
Running our Spark Datasource with Spark set up locally should be fine, and if you’re able to run PySpark you should have access to the spark-shell command! Our docs give you a helping hand here https://github.com/cognitedata/cdp-spark-datasource/#quickstart, but the command is simply this
./bin/pyspark --packages com.cognite.spark.datasource:cdf-spark-datasource_2.12:2.0.10
Make sure to run it from the right directory (the command here should work if run from the folder where you have Spark installed), and please check the Scala version of your local installation and match it (I suggested 2.12 here, which is likely what you have)
Thanks for the reply @Håkon Trømborg . I got a dependency error when I ran this command, but I got a dependency error. My Scala version is 2.13.8. I can install Scala 2.12, but where can I get a list of all com.cognite.spark.datasource releases?
https://github.com/cognitedata/cdp-spark-datasource/releases
Releases are here, but I see we haven’t been keeping it fully up to date. 2.13 should also be fine, do you get an error when you use
./bin/pyspark --packages com.cognite.spark.datasource:cdf-spark-datasource_2.13:2.0.10
?
Yes, it does not find the dependency.
:: com.cognite.spark.datasource#cdf-spark-datasource_2.13;2.0.10: not found
is the top error message.
Does this help?
File "/opt/homebrew/Cellar/apache-spark/3.2.1/libexec/python/pyspark/java_gateway.py", line 108, in launch_gateway
raise RuntimeError("Java gateway process exited before sending its port number")
RuntimeError: Java gateway process exited before sending its port number
I can otherwise use Python3, Scala, and Java on my Mac (M1)
Seems I’m also a little bit out of date here, we made a breaking change some time back and changed the name of the artifact. Does this work?
com.cognite.spark.datasource:cdf-spark-datasource-fat_2.13
Some progress here, but still an error. It finds the package now, but new complaints. First some warnings which are suspicious
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/homebrew/Cellar/apache-spark/3.2.1/libexec/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
and then this error
:: loading settings :: url = jar:file:/opt/homebrew/Cellar/apache-spark/3.2.1/libexec/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven Coordinates must be in the form 'groupId:artifactId:version'. The coordinate provided is: com.cognite.spark.datasource:cdf-spark-datasource-fat_2.13
maybe it wants ::Version?
Ah right, yes it wants the version (didn’t paste the full coordinate in my latest comment), so please add 2.0.10.
WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/homebrew/Cellar/apache-spark/3.2.1/libexec/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
This error mentions spark 2.12, could it also be that your local Spark installation happens to be Spark 3.2.1 on Scala 2.12?
Some more progress :)
I get initially no error if I start PySpark with this command
pyspark --packages com.cognite.spark.datasource:cdf-spark-datasource-fat_2.13:2.0.10
and no, I have 2.13.8
>> scala --version
>> Scala code runner version 2.13.8 -- Copyright 2002-2021, LAMP/EPFL and Lightbend, Inc
I appreciate the support so far @Håkon Trømborg
But I still have problems loading the data. I am trying to load some datapoints with this command
spark.read.format("cognite.spark.v1").option("type", "datapoints").option("apiKey", MY_API_KEY).option("project", MY_PROJECT).load()
where I originally have MY_API_KEY and MY_PROJECT as clear text, to avoid any bugs.
is this command correct?
The same command works in Azure Databricks notebook though, with the same values for apiKey and project
The error message is very long but here is a summary
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/homebrew/Cellar/apache-spark/3.2.1/libexec/python/pyspark/sql/readwriter.py", line 164, in load
return self._df(self._jreader.load())
File "/opt/homebrew/Cellar/apache-spark/3.2.1/libexec/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/opt/homebrew/Cellar/apache-spark/3.2.1/libexec/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/opt/homebrew/Cellar/apache-spark/3.2.1/libexec/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o52.load.
: java.lang.NoClassDefFoundError: Could not initialize class cognite.spark.v1.Constants$
Any ideas?
From the previous error it looks like your Spark installation is Spark 3.2.1 on Scala 2.12, and I’m suspecting this might cause that type of issue.. so I’d suggest downloading a Spark release with 2.13 as a start!
Ok, it seemed like I needed to start PySpark with this package
pyspark --packages com.cognite.spark.datasource:cdf-spark-datasource-fat_2.12:2.0.10
My system Scala installation is 2.13 but apache-spark has its own installed using HomeBrew (well, that’s what Brew does I guess). After this everything else works and I can download datapoints Thanks again @Håkon Trømborg
Great! Thanks for asking the question also, helped discover some problems in our docs Have fun Sparking!