[SPARK-4477] [PySpark] remove numpy from RDDSampler

In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy. numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927. It also complicate the code a lot, so we may should remove numpy from RDDSampler. I also did some benchmark to verify that: ``` >>> from pyspark.mllib.random import RandomRDDs >>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache() >>> rdd.count() # cache it >>> rdd.sample(True, 0.9).count() # measure this line ``` the results: |withReplacement | random | numpy.random | ------- | ------------ | ------- |True | 1.5 s| 1.4 s| |False| 0.6 s | 0.8 s| closes #2313 Note: this patch including some commits that not mirrored to github, it will be OK after it catches up. Author: Davies Liu <davies@databricks.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3351 from davies/numpy and squashes the following commits: 5c438d7 [Davies Liu] fix comment c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477 98eb31b [Xiangrui Meng] make poisson sampling slightly faster ee17d78 [Davies Liu] remove = for float 13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy f583023 [Davies Liu] fix tests 51649f5 [Davies Liu] remove numpy in RDDSampler 78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain f5fdf63 [Davies Liu] fix bug with int in weights 4dfa2cd [Davies Liu] refactor f866bcf [Davies Liu] remove unneeded change c7a2007 [Davies Liu] switch to python implementation 95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit 0d9b256 [Davies Liu] refactor 1715ee3 [Davies Liu] address comments 41fce54 [Davies Liu] randomSplit()
author: Davies Liu <davies@databricks.com> 2014-11-20 16:40:25 -0800
committer: Xiangrui Meng <meng@databricks.com> 2014-11-20 16:40:25 -0800
commit: d39f2e9c683a4ab78b29eb3c5668325bf8568e8c (patch)
tree: 7ee1e4035b25c5f9daf69a72641e512ace12afcd /python/pyspark/rdd.py
parent: ad5f1f3ca240473261162c06ffc5aa70d15a5991 (diff)
download: spark-d39f2e9c683a4ab78b29eb3c5668325bf8568e8c.tar.gz
spark-d39f2e9c683a4ab78b29eb3c5668325bf8568e8c.tar.bz2
spark-d39f2e9c683a4ab78b29eb3c5668325bf8568e8c.zip
1 files changed, 6 insertions, 4 deletions
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py
index 50535d2711..57754776fa 100644
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@@ -310,8 +310,11 @@ class RDD(object):
 
     def sample(self, withReplacement, fraction, seed=None):
         """
-        Return a sampled subset of this RDD (relies on numpy and falls back
-        on default random generator if numpy is unavailable).
+        Return a sampled subset of this RDD.
+
+        >>> rdd = sc.parallelize(range(100), 4)
+        >>> rdd.sample(False, 0.1, 81).count()
+        10
         """
         assert fraction >= 0.0, "Negative fraction value: %s" % fraction
         return self.mapPartitionsWithIndex(RDDSampler(withReplacement, fraction, seed).func, True)
@@ -343,8 +346,7 @@ class RDD(object):
     # this is ported from scala/spark/RDD.scala
     def takeSample(self, withReplacement, num, seed=None):
         """
-        Return a fixed-size sampled subset of this RDD (currently requires
-        numpy).
+        Return a fixed-size sampled subset of this RDD.
 
         >>> rdd = sc.parallelize(range(0, 10))
         >>> len(rdd.takeSample(True, 20, 1))
author	Davies Liu <davies@databricks.com>	2014-11-20 16:40:25 -0800
committer	Xiangrui Meng <meng@databricks.com>	2014-11-20 16:40:25 -0800
commit	d39f2e9c683a4ab78b29eb3c5668325bf8568e8c (patch)
tree	7ee1e4035b25c5f9daf69a72641e512ace12afcd /python/pyspark/rdd.py
parent	ad5f1f3ca240473261162c06ffc5aa70d15a5991 (diff)
download	spark-d39f2e9c683a4ab78b29eb3c5668325bf8568e8c.tar.gz spark-d39f2e9c683a4ab78b29eb3c5668325bf8568e8c.tar.bz2 spark-d39f2e9c683a4ab78b29eb3c5668325bf8568e8c.zip