FAQ | Apache Spark

Spark FAQ

Is Spark a modified version of Hadoop?

No. Spark is a completely separate codebase optimized for low latency, although it can load data from any Hadoop input source (InputFormat).

Which languages does Spark support?

Starting in version 0.7, Spark supports Scala, Java and Python.

Does Spark require modified versions of Scala or Python?

No. Spark requires no changes to Scala or compiler plugins. The Python API uses the standard CPython implementation, and can call into existing C libraries for Python such as NumPy.

What happens when a cached dataset does not fit in memory?

Spark can either spill it to disk or recompute the partitions that don't fit in RAM each time they are requested. By default, it uses recomputation, but you can set a dataset's storage level to MEMORY_AND_DISK to avoid this.

How can I run Spark on a cluster?

You can use either the standalone deploy mode, which depends only on Java, or the Apache Mesos cluster manager.

Note that you can also run Spark locally (possibly on multiple cores) without any special setup by just passing local[N] as the master URL, where N is the number of parallel threads you want.

I don't know Scala; how hard is it to pick it up to use Spark?

Scala itself is pretty easy to pick up if you have Java experience. Check out First Steps to Scala for a quick introduction, the Scala tutorial for Java programmers, or the free online book Programming in Scala.

Spark 0.6 also added a Java API, letting you use Spark from Java, and Spark 0.7 added a Python API.

What license is Spark under?

Starting in version 0.8, Spark will be under the Apache 2.0 license. Previous versions used the BSD license.

How can I contribute to Spark?

Contact the mailing list or send us a pull request on GitHub. We're glad to hear about your experience using Spark and to accept patches

If you would like to report an issue, post it to the Spark issue tracker.

Where can I get more help?

Please post on the spark-users mailing list. We'll be glad to help!