diff options
author | Doris Xin <doris.s.xin@gmail.com> | 2014-06-12 19:44:27 -0700 |
---|---|---|
committer | Xiangrui Meng <meng@databricks.com> | 2014-06-12 19:44:27 -0700 |
commit | 1de1d703bf6b7ca14f7b40bbefe9bf6fd6c8ce47 (patch) | |
tree | f99459c7412db3dd9479037c41e5a4055853ae09 /project | |
parent | 0154587ab71d1b864f97497dbb38bc52b87675be (diff) | |
download | spark-1de1d703bf6b7ca14f7b40bbefe9bf6fd6c8ce47.tar.gz spark-1de1d703bf6b7ca14f7b40bbefe9bf6fd6c8ce47.tar.bz2 spark-1de1d703bf6b7ca14f7b40bbefe9bf6fd6c8ce47.zip |
SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate.
Author: Doris Xin <doris.s.xin@gmail.com>
Author: dorx <doris.s.xin@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes #916 from dorx/takeSample and squashes the following commits:
5b061ae [Doris Xin] merge master
444e750 [Doris Xin] edge cases
3de882b [dorx] Merge pull request #2 from mengxr/SPARK-1939
82dde31 [Xiangrui Meng] update pyspark's takeSample
48d954d [Doris Xin] remove unused imports from RDDSuite
fb1452f [Doris Xin] allowing num to be greater than count in all cases
1481b01 [Doris Xin] washing test tubes and making coffee
dc699f3 [Doris Xin] give back imports removed by accident in rdd.py
64e445b [Doris Xin] logwarnning as soon as it enters the while loop
55518ed [Doris Xin] added TODO for logging in rdd.py
eff89e2 [Doris Xin] addressed reviewer comments.
ecab508 [Doris Xin] "fixed checkstyle violation
0a9b3e3 [Doris Xin] "reviewer comment addressed"
f80f270 [Doris Xin] Merge branch 'master' into takeSample
ae3ad04 [Doris Xin] fixed edge cases to prevent overflow
065ebcd [Doris Xin] Merge branch 'master' into takeSample
9bdd36e [Doris Xin] Check sample size and move computeFraction
e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
7cab53a [Doris Xin] fixed import bug in rdd.py
ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
Diffstat (limited to 'project')
-rw-r--r-- | project/SparkBuild.scala | 1 |
1 files changed, 1 insertions, 0 deletions
diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index 8b4885d3bb..2d60a44f04 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -349,6 +349,7 @@ object SparkBuild extends Build { libraryDependencies ++= Seq( "com.google.guava" % "guava" % "14.0.1", "org.apache.commons" % "commons-lang3" % "3.3.2", + "org.apache.commons" % "commons-math3" % "3.3" % "test", "com.google.code.findbugs" % "jsr305" % "1.3.9", "log4j" % "log4j" % "1.2.17", "org.slf4j" % "slf4j-api" % slf4jVersion, |