Some Notes and Errors Encountered After Trying Out Apache Spark

I was playing around with Apache Spark a couple of weeks back. For those of you who are not familiar with Apache Spark, they have a really good documentation which you should read here. You should also check out UC Berkeley’s paper on RDD which will help you gain a deeper understanding on how Spark works. Don’t be daunted by the paper, it’s actually a good read.

Here are a few things I jotted down while working on it.

to assign a name to master, you will need to create conf/spark-env.sh and set that value for SPARK_MASTER. Basically, you just need to add this:
- SPARK_MASTER_IP=local.paolo.com
You’ll be able to view the status of your cluster in in http://localhost:8080/

I have also encountered a few errors when I tried running my application. Here’s a list of the errors I’ve stumbled on together with steps on how I fixed it.

Initial job has not accepted any resources.
- Check your cluster UI to ensure that workers are registered and have sufficient memory. I realized that I only have two cores I could connect too, and Spark Shell was already using it up. You can either update your config, or shutdown the other application using the resources. Source
java.lang.IllegalStateException: unread block data.
- This is most likely caused by versioning issue. In my case, my Mac was running a different version of Scala (v2.11)compared to the one I downloaded (using Scala v2.10). To make this work, I just updated my Mac’s Scala version. I did this through:
  - brew info scala
  - brew search scala - to view the available taps
  - brew install homebrew/versions/scala210
  - If you encounter this: bash: scala: command not found try running brew link homebrew/versions/scala210
  - scala -version
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass spark “Class not found”.
- I encountered this when I was trying to run my test application via eclipse. I have a running master in the background so I wanted my test to connect to it instead of spawining its own. To make this work I had to create a fat jar containing all the dependencies. Using Shade plugin from maven will help fix the akka issue too. Source
ERROR ContextCleaner: Error in cleaning thread java.lang.InterruptedException. This can be safely ignored. Source

Hopefully, this will help someone out there who is also looking into Apache Spark.