aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark/rdd.py
diff options
context:
space:
mode:
authorAndrew Ash <andrew@andrewash.com>2014-06-17 11:47:48 -0700
committerReynold Xin <rxin@apache.org>2014-06-17 11:47:48 -0700
commitb92d16b114fd49e881d09e7974ad57b2a0df2906 (patch)
tree4bede6fbb3f5c230bc545a7464d0c1805b199b08 /python/pyspark/rdd.py
parente243c5ffacd70ecadaf5c91668955dcc8141e060 (diff)
downloadspark-b92d16b114fd49e881d09e7974ad57b2a0df2906.tar.gz
spark-b92d16b114fd49e881d09e7974ad57b2a0df2906.tar.bz2
spark-b92d16b114fd49e881d09e7974ad57b2a0df2906.zip
SPARK-1063 Add .sortBy(f) method on RDD
This never got merged from the apache/incubator-spark repo (which is now deleted) but there had been several rounds of code review on this PR there. I think this is ready for merging. Author: Andrew Ash <andrew@andrewash.com> This patch had conflicts when merged, resolved by Committer: Reynold Xin <rxin@apache.org> Closes #369 from ash211/sortby and squashes the following commits: d09147a [Andrew Ash] Fix Ordering import 43d0a53 [Andrew Ash] Fix missing .collect() 29a54ed [Andrew Ash] Re-enable test by converting to a closure 5a95348 [Andrew Ash] Add license for RDDSuiteUtils 64ed6e3 [Andrew Ash] Remove leaked diff d4de69a [Andrew Ash] Remove scar tissue 63638b5 [Andrew Ash] Add Python version of .sortBy() 45e0fde [Andrew Ash] Add Java version of .sortBy() adf84c5 [Andrew Ash] Re-indent to keep line lengths under 100 chars 9d9b9d8 [Andrew Ash] Use parentheses on .collect() calls 0457b69 [Andrew Ash] Ignore failing test 99f0baf [Andrew Ash] Merge branch 'master' into sortby 222ae97 [Andrew Ash] Try moving Ordering objects out to a different class 3fd0dd3 [Andrew Ash] Add (failing) test for sortByKey with explicit Ordering b8b5bbc [Andrew Ash] Align remove extra spaces that were used to align ='s in test code 8c53298 [Andrew Ash] Actually use ascending and numPartitions parameters 381eef2 [Andrew Ash] Correct silly typo 7db3e84 [Andrew Ash] Support ascending and numPartitions params in sortBy() 0f685fd [Andrew Ash] Merge remote-tracking branch 'origin/master' into sortby ca4490d [Andrew Ash] Add .sortBy(f) method on RDD
Diffstat (limited to 'python/pyspark/rdd.py')
-rw-r--r--python/pyspark/rdd.py12
1 files changed, 12 insertions, 0 deletions
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py
index bb4d035edc..65f63153cd 100644
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@@ -549,6 +549,18 @@ class RDD(object):
.mapPartitions(mapFunc,preservesPartitioning=True)
.flatMap(lambda x: x, preservesPartitioning=True))
+ def sortBy(self, keyfunc, ascending=True, numPartitions=None):
+ """
+ Sorts this RDD by the given keyfunc
+
+ >>> tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
+ >>> sc.parallelize(tmp).sortBy(lambda x: x[0]).collect()
+ [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]
+ >>> sc.parallelize(tmp).sortBy(lambda x: x[1]).collect()
+ [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
+ """
+ return self.keyBy(keyfunc).sortByKey(ascending, numPartitions).values()
+
def glom(self):
"""
Return an RDD created by coalescing all elements within each partition