aboutsummaryrefslogtreecommitdiff
path: root/core/pom.xml
diff options
context:
space:
mode:
authorDoris Xin <doris.s.xin@gmail.com>2014-06-12 19:44:27 -0700
committerXiangrui Meng <meng@databricks.com>2014-06-12 19:44:27 -0700
commit1de1d703bf6b7ca14f7b40bbefe9bf6fd6c8ce47 (patch)
treef99459c7412db3dd9479037c41e5a4055853ae09 /core/pom.xml
parent0154587ab71d1b864f97497dbb38bc52b87675be (diff)
downloadspark-1de1d703bf6b7ca14f7b40bbefe9bf6fd6c8ce47.tar.gz
spark-1de1d703bf6b7ca14f7b40bbefe9bf6fd6c8ce47.tar.bz2
spark-1de1d703bf6b7ca14f7b40bbefe9bf6fd6c8ce47.zip
SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate. Author: Doris Xin <doris.s.xin@gmail.com> Author: dorx <doris.s.xin@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #916 from dorx/takeSample and squashes the following commits: 5b061ae [Doris Xin] merge master 444e750 [Doris Xin] edge cases 3de882b [dorx] Merge pull request #2 from mengxr/SPARK-1939 82dde31 [Xiangrui Meng] update pyspark's takeSample 48d954d [Doris Xin] remove unused imports from RDDSuite fb1452f [Doris Xin] allowing num to be greater than count in all cases 1481b01 [Doris Xin] washing test tubes and making coffee dc699f3 [Doris Xin] give back imports removed by accident in rdd.py 64e445b [Doris Xin] logwarnning as soon as it enters the while loop 55518ed [Doris Xin] added TODO for logging in rdd.py eff89e2 [Doris Xin] addressed reviewer comments. ecab508 [Doris Xin] "fixed checkstyle violation 0a9b3e3 [Doris Xin] "reviewer comment addressed" f80f270 [Doris Xin] Merge branch 'master' into takeSample ae3ad04 [Doris Xin] fixed edge cases to prevent overflow 065ebcd [Doris Xin] Merge branch 'master' into takeSample 9bdd36e [Doris Xin] Check sample size and move computeFraction e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample 7cab53a [Doris Xin] fixed import bug in rdd.py ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD 1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
Diffstat (limited to 'core/pom.xml')
-rw-r--r--core/pom.xml5
1 files changed, 5 insertions, 0 deletions
diff --git a/core/pom.xml b/core/pom.xml
index c3d6b00a44..be56911b9e 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -68,6 +68,11 @@
<artifactId>commons-lang3</artifactId>
</dependency>
<dependency>
+ <groupId>org.apache.commons</groupId>
+ <artifactId>commons-math3</artifactId>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
<groupId>com.google.code.findbugs</groupId>
<artifactId>jsr305</artifactId>
</dependency>