[SPARK-12091] [PYSPARK] Deprecate the JAVA-specific deserialized storage levels

The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs. davies Is this inconsistency intentional? Thanks! Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY. Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`. Author: gatorsmile <gatorsmile@gmail.com> Closes #10092 from gatorsmile/persistStorageLevel.
author: gatorsmile <gatorsmile@gmail.com> 2015-12-18 20:06:05 -0800
committer: Davies Liu <davies.liu@gmail.com> 2015-12-18 20:06:05 -0800
commit: 499ac3e69a102f9b10a1d7e14382fa191516f7b5 (patch)
tree: dcb84cda0ddb75094a39946dfd09f0ed29dd058c /docs
parent: a78a91f4d7239c14bd5d0b18cdc87d55594a8d8a (diff)
download: spark-499ac3e69a102f9b10a1d7e14382fa191516f7b5.tar.gz
spark-499ac3e69a102f9b10a1d7e14382fa191516f7b5.tar.bz2
spark-499ac3e69a102f9b10a1d7e14382fa191516f7b5.zip
2 files changed, 10 insertions, 7 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index 38d3d059f9..85e7d1202d 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -687,9 +687,10 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.rdd.compress</code></td>
   <td>false</td>
   <td>
-    Whether to compress serialized RDD partitions (e.g. for
-    <code>StorageLevel.MEMORY_ONLY_SER</code>). Can save substantial space at the cost of some
-    extra CPU time.
+    Whether to compress serialized RDD partitions (e.g. for 
+    <code>StorageLevel.MEMORY_ONLY_SER</code> in Java 
+    and Scala or <code>StorageLevel.MEMORY_ONLY</code> in Python). 
+    Can save substantial space at the cost of some extra CPU time. 
   </td>
 </tr>
 <tr>
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index f823b89a4b..c5e2a1cd7b 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -1196,14 +1196,14 @@ storage levels is:
     partitions that don't fit on disk, and read them from there when they're needed. </td>
 </tr>
 <tr>
-  <td> MEMORY_ONLY_SER </td>
+  <td> MEMORY_ONLY_SER <br /> (Java and Scala) </td>
   <td> Store RDD as <i>serialized</i> Java objects (one byte array per partition).
     This is generally more space-efficient than deserialized objects, especially when using a
     <a href="tuning.html">fast serializer</a>, but more CPU-intensive to read.
   </td>
 </tr>
 <tr>
-  <td> MEMORY_AND_DISK_SER </td>
+  <td> MEMORY_AND_DISK_SER <br /> (Java and Scala) </td>
   <td> Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of
     recomputing them on the fly each time they're needed. </td>
 </tr>
@@ -1230,7 +1230,9 @@ storage levels is:
 </tr>
 </table>
 
-**Note:** *In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level.*
+**Note:** *In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, 
+so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, 
+`MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`.*
 
 Spark also automatically persists some intermediate data in shuffle operations (e.g. `reduceByKey`), even without users calling `persist`. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call `persist` on the resulting RDD if they plan to reuse it.
 
@@ -1243,7 +1245,7 @@ efficiency. We recommend going through the following process to select one:
   This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
 
 * If not, try using `MEMORY_ONLY_SER` and [selecting a fast serialization library](tuning.html) to
-make the objects much more space-efficient, but still reasonably fast to access.
+make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)
 
 * Don't spill to disk unless the functions that computed your datasets are expensive, or they filter
 a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from
author	gatorsmile <gatorsmile@gmail.com>	2015-12-18 20:06:05 -0800
committer	Davies Liu <davies.liu@gmail.com>	2015-12-18 20:06:05 -0800
commit	499ac3e69a102f9b10a1d7e14382fa191516f7b5 (patch)
tree	dcb84cda0ddb75094a39946dfd09f0ed29dd058c /docs
parent	a78a91f4d7239c14bd5d0b18cdc87d55594a8d8a (diff)
download	spark-499ac3e69a102f9b10a1d7e14382fa191516f7b5.tar.gz spark-499ac3e69a102f9b10a1d7e14382fa191516f7b5.tar.bz2 spark-499ac3e69a102f9b10a1d7e14382fa191516f7b5.zip