I was playing around with Apache Spark a couple of weeks back. For those of you who are not familiar with Apache Spark, they have a really good documentation which you should read here. You should also check out UC Berkeley’s paper on RDD which will help you gain a deeper understanding on how Spark works. Don’t be daunted by the paper, it’s actually a good read.
Here are a few things I jotted down while working on it.
- to assign a name to master, you will need to create
conf/spark-env.sh
and set that value for SPARK_MASTER. Basically, you just need to add this:SPARK_MASTER_IP=local.paolo.com
- You’ll be able to view the status of your cluster in in
http://localhost:8080/
I have also encountered a few errors when I tried running my application. Here’s a list of the errors I’ve stumbled on together with steps on how I fixed it.
- Initial job has not accepted any resources.
- Check your cluster UI to ensure that workers are registered and have sufficient memory. I realized that I only have two cores I could connect too, and Spark Shell was already using it up. You can either update your config, or shutdown the other application using the resources. Source
- java.lang.IllegalStateException: unread block data.
- This is most likely caused by versioning issue. In my case, my Mac was running a different version of Scala (v2.11)compared to the one I downloaded (using Scala v2.10). To make this work, I just updated my Mac’s Scala version. I did this through:
brew info scala
brew search scala
- to view the available tapsbrew install homebrew/versions/scala210
- If you encounter this:
bash: scala: command not found
try runningbrew link homebrew/versions/scala210
scala -version
- This is most likely caused by versioning issue. In my case, my Mac was running a different version of Scala (v2.11)compared to the one I downloaded (using Scala v2.10). To make this work, I just updated my Mac’s Scala version. I did this through:
- org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass spark “Class not found”.
- I encountered this when I was trying to run my test application via eclipse. I have a running master in the background so I wanted my test to connect to it instead of spawining its own. To make this work I had to create a fat jar containing all the dependencies. Using Shade plugin from maven will help fix the akka issue too. Source
- ERROR ContextCleaner: Error in cleaning thread java.lang.InterruptedException. This can be safely ignored. Source
Hopefully, this will help someone out there who is also looking into Apache Spark.