Spark is an open-source cluster computing system that provides primitives for in-memory computing and thus for certain tasks may be superior to a system like Hadoop, which has to keep going back and forth to disk. Spark is written in Scala and is, at the time of this writing, in Apache incubation. The documentation is comprehensive and the getting started instructions make it sound like the basics should be up and running in a few minutes without any effort. Of course it’s never quite that easy. Here is a detailed account of the problems I encountered getting the Spark demo to run and how I worked around them.
According to the instructions, you download the source and build with the Simple Build Tool. This worked as advertised for me. The next step is to run an example program that estimates the value of π. The command line is
./run spark.examples.SparkPi. When I copied this into my terminal I got the following.
./run spark.examples.SparkPi SCALA_HOME is not set
Fair enough: the documentation says that
SCALA_HOME is required. It is not sufficient to have
scala on your path. I do not have this environment variable set on my machine. Digging in to the run script, there is a
SPARK_LAUNCH_WITH_SCALA option which will infer the Scala home directory from the executable. Here is what happens if I set it.
SPARK_LAUNCH_WITH_SCALA=1 ./run spark.examples.SparkPi local java.lang.ClassNotFoundException: scala.reflect.ClassManifest at java.net.URLClassLoader$1.run(URLClassLoader.java:202) ...
At the time of this writing, Spark requires Scala version 2.9.3. It is not compatible with later versions of Scala. If you have Scala 2.9.3 on your machine, the above command should work. I, however, am using Homebrew to manage the installation of Scala on my Mac, and it has installed the latest version, which is 2.10.2, hence the error. The same thing happens if you point
SCALA_HOME to a non-2.9.3 directory.
(By the way, if you point
SCALA_HOME at a directory that doesn’t contain a Scala installation you see the following error.
SCALA_HOME=. ./run spark.examples.SparkPi local Exception in thread "main" java.lang.NoClassDefFoundError: scala/ScalaObject at java.lang.ClassLoader.defineClass1(Native Method) ...
To recap, “java.lang.ClassNotFoundException: scala.reflect.ClassManifest” means incorrect Scala version and “java.lang.NoClassDefFoundError: scala/ScalaObject” means not a Scala directory.)
The solution is to install Scala 2.9.3. Multiple version support in Homebrew is tricky, so I decided the easiest thing to do was download 2.9.3 directly from the Scala website. I unzipped the tarball into a scala-2.9.3 directory alongside my Spark install.
SCALA_HOME=../scala-2.9.3/ ./run spark.examples.SparkPi Usage: SparkPi <master> [<slices>]
That looks much better. The final step is to give it the command line option
local, which runs things on the local machine.
SCALA_HOME=../scala-2.9.3/ ./run spark.examples.SparkPi local ... Pi is roughly 3.13908
Once Spark has forward compatibility with Scala, this trickiness should go away. Until then, these are snags to be aware of.