--- layout: global title: Examples type: "page singular" navigation: weight: 4 show: true ---
map
, filter
, and join
), and actions, which force the computation of a dataset and return a result (e.g., count
). The following examples show off some of the available operations and features.
The red code fragments are Scala function literals (closures) that get passed automatically to the cluster. The blue ones are Spark operations.
Spark can cache datasets in memory to speed up reuse. In the example above, we can load just the error messages in RAM using:
After the first action that uses errors
, later ones will be much faster.
In this example, we use a few more transformations to build a dataset of (String, Int) pairs called counts
and then save it to a file.
Spark can also be used for compute-intensive tasks. This code estimates π by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be π / 4, so we use this to get our estimate.
This is an iterative machine learning algorithm that seeks to find the best hyperplane that separates two sets of points in a multi-dimensional feature space. It can be used to classify messages into spam vs non-spam, for example. Because the algorithm applies the same MapReduce operation repeatedly to the same dataset, it benefits greatly from caching the input data in RAM across iterations.
Note that w
gets shipped automatically to the cluster with every map
call.
The graph below compares the performance of this Spark program against a Hadoop implementation on 30 GB of data on an 80-core cluster, showing the benefit of in-memory caching: