diff options
author | jbencook <jbenjamincook@gmail.com> | 2014-12-23 17:46:24 -0800 |
---|---|---|
committer | Josh Rosen <joshrosen@databricks.com> | 2014-12-23 17:46:24 -0800 |
commit | fd41eb9574280b5cfee9b94b4f92e4c44363fb14 (patch) | |
tree | 1637c2f5ee28a4f0990093c934ac4aa1ced81167 /ec2/deploy.generic | |
parent | 7e2deb71c4239564631b19c748e95c3d1aa1c77d (diff) | |
download | spark-fd41eb9574280b5cfee9b94b4f92e4c44363fb14.tar.gz spark-fd41eb9574280b5cfee9b94b4f92e4c44363fb14.tar.bz2 spark-fd41eb9574280b5cfee9b94b4f92e4c44363fb14.zip |
[SPARK-4860][pyspark][sql] speeding up `sample()` and `takeSample()`
This PR modifies the python `SchemaRDD` to use `sample()` and `takeSample()` from Scala instead of the slower python implementations from `rdd.py`. This is worthwhile because the `Row`'s are already serialized as Java objects.
In order to use the faster `takeSample()`, a `takeSampleToPython()` method was implemented in `SchemaRDD.scala` following the pattern of `collectToPython()`.
Author: jbencook <jbenjamincook@gmail.com>
Author: J. Benjamin Cook <jbenjamincook@gmail.com>
Closes #3764 from jbencook/master and squashes the following commits:
6fbc769 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing sloppy indentation for takeSampleToPython() arguments
5170da2 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing typo: from RDD to SchemaRDD
de22f70 [jbencook] [SPARK-4860][pyspark][sql] using sample() method from JavaSchemaRDD
b916442 [jbencook] [SPARK-4860][pyspark][sql] adding sample() to JavaSchemaRDD
020cbdf [jbencook] [SPARK-4860][pyspark][sql] using Scala implementations of `sample()` and `takeSample()`
Diffstat (limited to 'ec2/deploy.generic')
0 files changed, 0 insertions, 0 deletions