diff options
author | Reynold Xin <rxin@databricks.com> | 2016-02-26 22:35:12 -0800 |
---|---|---|
committer | Reynold Xin <rxin@databricks.com> | 2016-02-26 22:35:12 -0800 |
commit | 59e3e10be2f9a1c53979ca72c038adb4fa17ca64 (patch) | |
tree | 3d6b2246738484273d36d0ccbec66b733930a3e0 /docs/programming-guide.md | |
parent | f77dc4e1e202942aa8393fb5d8f492863973fe17 (diff) | |
download | spark-59e3e10be2f9a1c53979ca72c038adb4fa17ca64.tar.gz spark-59e3e10be2f9a1c53979ca72c038adb4fa17ca64.tar.bz2 spark-59e3e10be2f9a1c53979ca72c038adb4fa17ca64.zip |
[SPARK-13521][BUILD] Remove reference to Tachyon in cluster & release scripts
## What changes were proposed in this pull request?
We provide a very limited set of cluster management script in Spark for Tachyon, although Tachyon itself provides a much better version of it. Given now Spark users can simply use Tachyon as a normal file system and does not require extensive configurations, we can remove this management capabilities to simplify Spark bash scripts.
Note that this also reduces coupling between a 3rd party external system and Spark's release scripts, and would eliminate possibility for failures such as Tachyon being renamed or the tar balls being relocated.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes #11400 from rxin/release-script.
Diffstat (limited to 'docs/programming-guide.md')
-rw-r--r-- | docs/programming-guide.md | 22 |
1 files changed, 2 insertions, 20 deletions
diff --git a/docs/programming-guide.md b/docs/programming-guide.md index 5ebafa40b0..2f0ed5eca2 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -1177,7 +1177,7 @@ that originally created it. In addition, each persisted RDD can be stored using a different *storage level*, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), -replicate it across nodes, or store it off-heap in [Tachyon](http://tachyon-project.org/). +replicate it across nodes. These levels are set by passing a `StorageLevel` object ([Scala](api/scala/index.html#org.apache.spark.storage.StorageLevel), [Java](api/java/index.html?org/apache/spark/storage/StorageLevel.html), @@ -1218,24 +1218,11 @@ storage levels is: <td> MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. </td> <td> Same as the levels above, but replicate each partition on two cluster nodes. </td> </tr> -<tr> - <td> OFF_HEAP (experimental) </td> - <td> Store RDD in serialized format in <a href="http://tachyon-project.org">Tachyon</a>. - Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors - to be smaller and to share a pool of memory, making it attractive in environments with - large heaps or multiple concurrent applications. Furthermore, as the RDDs reside in Tachyon, - the crash of an executor does not lead to losing the in-memory cache. In this mode, the memory - in Tachyon is discardable. Thus, Tachyon does not attempt to reconstruct a block that it evicts - from memory. If you plan to use Tachyon as the off heap store, Spark is compatible with Tachyon - out-of-the-box. Please refer to this <a href="http://tachyon-project.org/master/Running-Spark-on-Tachyon.html">page</a> - for the suggested version pairings. - </td> -</tr> </table> **Note:** *In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, -`MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`.* +`MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, and `DISK_ONLY_2`.* Spark also automatically persists some intermediate data in shuffle operations (e.g. `reduceByKey`), even without users calling `persist`. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call `persist` on the resulting RDD if they plan to reuse it. @@ -1259,11 +1246,6 @@ requests from a web application). *All* the storage levels provide full fault to recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition. -* In environments with high amounts of memory or multiple applications, the experimental `OFF_HEAP` -mode has several advantages: - * It allows multiple executors to share the same pool of memory in Tachyon. - * It significantly reduces garbage collection costs. - * Cached data is not lost if individual executors crash. ### Removing Data |