diff options
author | Andrew Ash <andrew@andrewash.com> | 2014-06-17 11:47:48 -0700 |
---|---|---|
committer | Reynold Xin <rxin@apache.org> | 2014-06-17 11:47:48 -0700 |
commit | b92d16b114fd49e881d09e7974ad57b2a0df2906 (patch) | |
tree | 4bede6fbb3f5c230bc545a7464d0c1805b199b08 /python/pyspark/rdd.py | |
parent | e243c5ffacd70ecadaf5c91668955dcc8141e060 (diff) | |
download | spark-b92d16b114fd49e881d09e7974ad57b2a0df2906.tar.gz spark-b92d16b114fd49e881d09e7974ad57b2a0df2906.tar.bz2 spark-b92d16b114fd49e881d09e7974ad57b2a0df2906.zip |
SPARK-1063 Add .sortBy(f) method on RDD
This never got merged from the apache/incubator-spark repo (which is now deleted) but there had been several rounds of code review on this PR there.
I think this is ready for merging.
Author: Andrew Ash <andrew@andrewash.com>
This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@apache.org>
Closes #369 from ash211/sortby and squashes the following commits:
d09147a [Andrew Ash] Fix Ordering import
43d0a53 [Andrew Ash] Fix missing .collect()
29a54ed [Andrew Ash] Re-enable test by converting to a closure
5a95348 [Andrew Ash] Add license for RDDSuiteUtils
64ed6e3 [Andrew Ash] Remove leaked diff
d4de69a [Andrew Ash] Remove scar tissue
63638b5 [Andrew Ash] Add Python version of .sortBy()
45e0fde [Andrew Ash] Add Java version of .sortBy()
adf84c5 [Andrew Ash] Re-indent to keep line lengths under 100 chars
9d9b9d8 [Andrew Ash] Use parentheses on .collect() calls
0457b69 [Andrew Ash] Ignore failing test
99f0baf [Andrew Ash] Merge branch 'master' into sortby
222ae97 [Andrew Ash] Try moving Ordering objects out to a different class
3fd0dd3 [Andrew Ash] Add (failing) test for sortByKey with explicit Ordering
b8b5bbc [Andrew Ash] Align remove extra spaces that were used to align ='s in test code
8c53298 [Andrew Ash] Actually use ascending and numPartitions parameters
381eef2 [Andrew Ash] Correct silly typo
7db3e84 [Andrew Ash] Support ascending and numPartitions params in sortBy()
0f685fd [Andrew Ash] Merge remote-tracking branch 'origin/master' into sortby
ca4490d [Andrew Ash] Add .sortBy(f) method on RDD
Diffstat (limited to 'python/pyspark/rdd.py')
-rw-r--r-- | python/pyspark/rdd.py | 12 |
1 files changed, 12 insertions, 0 deletions
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py index bb4d035edc..65f63153cd 100644 --- a/python/pyspark/rdd.py +++ b/python/pyspark/rdd.py @@ -549,6 +549,18 @@ class RDD(object): .mapPartitions(mapFunc,preservesPartitioning=True) .flatMap(lambda x: x, preservesPartitioning=True)) + def sortBy(self, keyfunc, ascending=True, numPartitions=None): + """ + Sorts this RDD by the given keyfunc + + >>> tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)] + >>> sc.parallelize(tmp).sortBy(lambda x: x[0]).collect() + [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)] + >>> sc.parallelize(tmp).sortBy(lambda x: x[1]).collect() + [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)] + """ + return self.keyBy(keyfunc).sortByKey(ascending, numPartitions).values() + def glom(self): """ Return an RDD created by coalescing all elements within each partition |