SPARK-1305: Support persisting RDD's directly to Tachyon

Move the PR#468 of apache-incubator-spark to the apache-spark "Adding an option to persist Spark RDD blocks into Tachyon." Author: Haoyuan Li <haoyuan@cs.berkeley.edu> Author: RongGu <gurongwalker@gmail.com> Closes #158 from RongGu/master and squashes the following commits: 72b7768 [Haoyuan Li] merge master 9f7fa1b [Haoyuan Li] fix code style ae7834b [Haoyuan Li] minor cleanup a8b3ec6 [Haoyuan Li] merge master branch e0f4891 [Haoyuan Li] better check offheap. 55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel 7cd4600 [RongGu] remove some logic code for tachyonstore's replication 51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore 8adfcfa [RongGu] address arron's comment on inTachyonSize 120e48a [RongGu] changed the root-level dir name in Tachyon 5cc041c [Haoyuan Li] address aaron's comments 9b97935 [Haoyuan Li] address aaron's comments d9a6438 [Haoyuan Li] fix for pspark 77d2703 [Haoyuan Li] change python api.git status 3dcace4 [Haoyuan Li] address matei's comments 91fa09d [Haoyuan Li] address patrick's comments 589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE 64348b2 [Haoyuan Li] update conf docs. ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1 619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler 49cc724 [Haoyuan Li] update docs with off_headp option 4572f9f [RongGu] reserving the old apply function API of StorageLevel 04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP 76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix 939e467 [Haoyuan Li] 0.4.1-thrift from maven central 86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1 16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1 bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem 6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1 d827250 [RongGu] fix JsonProtocolSuie test failure 716e93b [Haoyuan Li] revert the version ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift 2825a13 [RongGu] up-merging to the current master branch of the apache spark 6a22c1a [Haoyuan Li] fix scalastyle 8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client. 77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice. 1dcadf9 [Haoyuan Li] typo bf278fa [Haoyuan Li] fix python tests e82909c [Haoyuan Li] minor cleanup 776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR 8859371 [Haoyuan Li] various minor fixes and clean up e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode. fcaeab2 [Haoyuan Li] address Aaron's comment e554b1e [Haoyuan Li] add python code 47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels. dc8ef24 [Haoyuan Li] add old storelevel constructor e01a271 [Haoyuan Li] update tachyon 0.4.1 8011a96 [RongGu] fix a brought-in mistake in StorageLevel 70ca182 [RongGu] a bit change in comment 556978b [RongGu] fix the scalastyle errors 791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark
author: Haoyuan Li <haoyuan@cs.berkeley.edu> 2014-04-04 20:36:24 -0700
committer: Patrick Wendell <pwendell@gmail.com> 2014-04-04 20:38:20 -0700
commit: b50ddfde0342990979979e58348f54c10b500c90 (patch)
tree: cc7fa4d089375cded5056d9a93079e0b23a32ae7 /docs/configuration.md
parent: 1347ebd4b52ffb9197fc4137a55dff6badb149ba (diff)
download: spark-b50ddfde0342990979979e58348f54c10b500c90.tar.gz
spark-b50ddfde0342990979979e58348f54c10b500c90.tar.bz2
spark-b50ddfde0342990979979e58348f54c10b500c90.zip
1 files changed, 27 insertions, 12 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index 1ff0150567..b6005acac8 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -123,6 +123,21 @@ Apart from these, the following properties are also available, and may be useful
   </td>
 </tr>
 <tr>
+  <td>spark.tachyonStore.baseDir</td>
+  <td>System.getProperty("java.io.tmpdir")</td>
+  <td>
+    Directories of the Tachyon File System that store RDDs. The Tachyon file system's URL is set by <code>spark.tachyonStore.url</code>.
+    It can also be a comma-separated list of multiple directories on Tachyon file system.
+  </td>
+</tr>
+<tr>
+  <td>spark.tachyonStore.url</td>
+  <td>tachyon://localhost:19998</td>
+  <td>
+    The URL of the underlying Tachyon file system in the TachyonStore.
+  </td>
+</tr>
+<tr>
   <td>spark.mesos.coarse</td>
   <td>false</td>
   <td>
@@ -161,13 +176,13 @@ Apart from these, the following properties are also available, and may be useful
   <td>spark.ui.acls.enable</td>
   <td>false</td>
   <td>
-    Whether spark web ui acls should are enabled. If enabled, this checks to see if the user has 
+    Whether spark web ui acls should are enabled. If enabled, this checks to see if the user has
     access permissions to view the web ui. See <code>spark.ui.view.acls</code> for more details.
     Also note this requires the user to be known, if the user comes across as null no checks
     are done. Filters can be used to authenticate and set the user.
   </td>
 </tr>
-<tr>  
+<tr>
   <td>spark.ui.view.acls</td>
   <td>Empty</td>
   <td>
@@ -276,10 +291,10 @@ Apart from these, the following properties are also available, and may be useful
   <td>spark.serializer.objectStreamReset</td>
   <td>10000</td>
   <td>
-    When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches 
-    objects to prevent writing redundant data, however that stops garbage collection of those 
-    objects. By calling 'reset' you flush that info from the serializer, and allow old 
-    objects to be collected. To turn off this periodic reset set it to a value of <= 0. 
+    When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches
+    objects to prevent writing redundant data, however that stops garbage collection of those
+    objects. By calling 'reset' you flush that info from the serializer, and allow old
+    objects to be collected. To turn off this periodic reset set it to a value of <= 0.
     By default it will reset the serializer every 10,000 objects.
   </td>
 </tr>
@@ -375,7 +390,7 @@ Apart from these, the following properties are also available, and may be useful
   <td>spark.akka.heartbeat.interval</td>
   <td>1000</td>
   <td>
-    This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). A larger interval value in seconds reduces network overhead and a smaller value ( ~ 1 s) might be more informative for akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.failure-detector.threshold` if you need to. Only positive use case for using failure detector can be, a sensistive failure detector can help evict rogue executors really quick. However this is usually not the case as gc pauses and network lags are expected in a real spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats between nodes leading to flooding the network with those. 
+    This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). A larger interval value in seconds reduces network overhead and a smaller value ( ~ 1 s) might be more informative for akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.failure-detector.threshold` if you need to. Only positive use case for using failure detector can be, a sensistive failure detector can help evict rogue executors really quick. However this is usually not the case as gc pauses and network lags are expected in a real spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats between nodes leading to flooding the network with those.
   </td>
 </tr>
 <tr>
@@ -430,7 +445,7 @@ Apart from these, the following properties are also available, and may be useful
   <td>spark.broadcast.blockSize</td>
   <td>4096</td>
   <td>
-    Size of each piece of a block in kilobytes for <code>TorrentBroadcastFactory</code>. 
+    Size of each piece of a block in kilobytes for <code>TorrentBroadcastFactory</code>.
     Too large a value decreases parallelism during broadcast (makes it slower); however, if it is too small, <code>BlockManager</code> might take a performance hit.
   </td>
 </tr>
@@ -555,7 +570,7 @@ Apart from these, the following properties are also available, and may be useful
     the driver.
   </td>
 </tr>
-<tr>  
+<tr>
   <td>spark.authenticate</td>
   <td>false</td>
   <td>
@@ -563,7 +578,7 @@ Apart from these, the following properties are also available, and may be useful
     running on Yarn.
   </td>
 </tr>
-<tr>  
+<tr>
   <td>spark.authenticate.secret</td>
   <td>None</td>
   <td>
@@ -571,12 +586,12 @@ Apart from these, the following properties are also available, and may be useful
     not running on Yarn and authentication is enabled.
   </td>
 </tr>
-<tr>  
+<tr>
   <td>spark.core.connection.auth.wait.timeout</td>
   <td>30</td>
   <td>
     Number of seconds for the connection to wait for authentication to occur before timing
-    out and giving up. 
+    out and giving up.
   </td>
 </tr>
 <tr>
author	Haoyuan Li <haoyuan@cs.berkeley.edu>	2014-04-04 20:36:24 -0700
committer	Patrick Wendell <pwendell@gmail.com>	2014-04-04 20:38:20 -0700
commit	b50ddfde0342990979979e58348f54c10b500c90 (patch)
tree	cc7fa4d089375cded5056d9a93079e0b23a32ae7 /docs/configuration.md
parent	1347ebd4b52ffb9197fc4137a55dff6badb149ba (diff)
download	spark-b50ddfde0342990979979e58348f54c10b500c90.tar.gz spark-b50ddfde0342990979979e58348f54c10b500c90.tar.bz2 spark-b50ddfde0342990979979e58348f54c10b500c90.zip