From 7bcb08cef5e6438ce8c8efa3da3a8f94f2a1fbf9 Mon Sep 17 00:00:00 2001 From: Matei Zaharia Date: Thu, 27 Sep 2012 17:50:59 -0700 Subject: Renamed storage levels to something cleaner; fixes #223. --- docs/scala-programming-guide.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) (limited to 'docs') diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md index 28e7bdd4c9..a370bf3ddc 100644 --- a/docs/scala-programming-guide.md +++ b/docs/scala-programming-guide.md @@ -231,30 +231,30 @@ One of the most important capabilities in Spark is *persisting* (or *caching*) a You can mark an RDD to be persisted using the `persist()` or `cache()` methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. The cache is fault-tolerant -- if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it. -In addition, each RDD can be stored using a different *storage level*, allowing you, for example, to persist the dataset on disk, or persist it in memory but as serialized Java objects (to save space), or even replicate it across nodes. These levels are chosen by passing a [`spark.storage.StorageLevel`]({{HOME_PATH}}api/core/index.html#spark.storage.StorageLevel) object to `persist()`. The `cache()` method is a shorthand for using the default storage level, which is `StorageLevel.MEMORY_ONLY_DESER` (store deserialized objects in memory). The complete set of available storage levels is: +In addition, each RDD can be stored using a different *storage level*, allowing you, for example, to persist the dataset on disk, or persist it in memory but as serialized Java objects (to save space), or even replicate it across nodes. These levels are chosen by passing a [`spark.storage.StorageLevel`]({{HOME_PATH}}api/core/index.html#spark.storage.StorageLevel) object to `persist()`. The `cache()` method is a shorthand for using the default storage level, which is `StorageLevel.MEMORY_ONLY` (store deserialized objects in memory). The complete set of available storage levels is: - + - + - + - - + - - + @@ -262,8 +262,8 @@ In addition, each RDD can be stored using a different *storage level*, allowing - - + +
Storage LevelMeaning
Storage LevelMeaning
MEMORY_ONLY_DESER MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
DISK_AND_MEMORY_DESER MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
MEMORY_ONLY Store RDD as serialized Java objects (that is, one byte array per partition). + MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
DISK_AND_MEMORY Similar to MEMORY_ONLY, but spill partitions that don't fit in memory to disk instead of recomputing them + MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.
Store the RDD partitions only on disk.
MEMORY_ONLY_DESER_2 / DISK_AND_MEMORY_DESER_2 / MEMORY_ONLY_2 / DISK_ONLY_2 / DISK_AND_MEMORY_2 Same as the levels above, but replicate each partition on two nodes. MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as the levels above, but replicate each partition on two cluster nodes.
@@ -272,9 +272,9 @@ In addition, each RDD can be stored using a different *storage level*, allowing Spark's storage levels are meant to provide different tradeoffs between memory usage and CPU efficiency. We recommend going through the following process to select one: -* If your RDDs fit comfortably with the default storage level (`MEMORY_ONLY_DESER`), leave them that way. This is the most +* If your RDDs fit comfortably with the default storage level (`MEMORY_ONLY`), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible. -* If not, try using `MEMORY_ONLY` and [selecting a fast serialization library]({{HOME_PATH}}tuning.html) to make the objects +* If not, try using `MEMORY_ONLY_SER` and [selecting a fast serialization library]({{HOME_PATH}}tuning.html) to make the objects much more space-efficient, but still reasonably fast to access. * Don't spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition is about as fast as reading it from disk. -- cgit v1.2.3