diff options
author | Davies Liu <davies@databricks.com> | 2014-11-20 16:40:25 -0800 |
---|---|---|
committer | Xiangrui Meng <meng@databricks.com> | 2014-11-20 16:40:25 -0800 |
commit | d39f2e9c683a4ab78b29eb3c5668325bf8568e8c (patch) | |
tree | 7ee1e4035b25c5f9daf69a72641e512ace12afcd /sql/hive/v0.13.1/src/main | |
parent | ad5f1f3ca240473261162c06ffc5aa70d15a5991 (diff) | |
download | spark-d39f2e9c683a4ab78b29eb3c5668325bf8568e8c.tar.gz spark-d39f2e9c683a4ab78b29eb3c5668325bf8568e8c.tar.bz2 spark-d39f2e9c683a4ab78b29eb3c5668325bf8568e8c.zip |
[SPARK-4477] [PySpark] remove numpy from RDDSampler
In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy.
numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927.
It also complicate the code a lot, so we may should remove numpy from RDDSampler.
I also did some benchmark to verify that:
```
>>> from pyspark.mllib.random import RandomRDDs
>>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache()
>>> rdd.count() # cache it
>>> rdd.sample(True, 0.9).count() # measure this line
```
the results:
|withReplacement | random | numpy.random |
------- | ------------ | -------
|True | 1.5 s| 1.4 s|
|False| 0.6 s | 0.8 s|
closes #2313
Note: this patch including some commits that not mirrored to github, it will be OK after it catches up.
Author: Davies Liu <davies@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes #3351 from davies/numpy and squashes the following commits:
5c438d7 [Davies Liu] fix comment
c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477
98eb31b [Xiangrui Meng] make poisson sampling slightly faster
ee17d78 [Davies Liu] remove = for float
13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy
f583023 [Davies Liu] fix tests
51649f5 [Davies Liu] remove numpy in RDDSampler
78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
f5fdf63 [Davies Liu] fix bug with int in weights
4dfa2cd [Davies Liu] refactor
f866bcf [Davies Liu] remove unneeded change
c7a2007 [Davies Liu] switch to python implementation
95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
0d9b256 [Davies Liu] refactor
1715ee3 [Davies Liu] address comments
41fce54 [Davies Liu] randomSplit()
Diffstat (limited to 'sql/hive/v0.13.1/src/main')
0 files changed, 0 insertions, 0 deletions