aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark/rdd.py
Commit message (Collapse)AuthorAgeFilesLines
...
* Spark 1246 add min max to stat counterDan McClary2014-03-181-0/+19
| | | | | | | | | | | | | | | | | | | | | Here's the addition of min and max to statscounter.py and min and max methods to rdd.py. Author: Dan McClary <dan.mcclary@gmail.com> Closes #144 from dwmclary/SPARK-1246-add-min-max-to-stat-counter and squashes the following commits: fd3fd4b [Dan McClary] fixed error, updated test 82cde0e [Dan McClary] flipped incorrectly assigned inf values in StatCounter 5d96799 [Dan McClary] added max and min to StatCounter repr for pyspark 21dd366 [Dan McClary] added max and min to StatCounter output, updated doc 1a97558 [Dan McClary] added max and min to StatCounter output, updated doc a5c13b0 [Dan McClary] Added min and max to Scala and Java RDD, added min and max to StatCounter ed67136 [Dan McClary] broke min/max out into separate transaction, added to rdd.py 1e7056d [Dan McClary] added underscore to getBucket 37a7dea [Dan McClary] cleaned up boundaries for histogram -- uses real min/max when buckets are derived 29981f2 [Dan McClary] fixed indentation on doctest comment eaf89d9 [Dan McClary] added correct doctest for histogram 4916016 [Dan McClary] added histogram method, added max and min to statscounter
* SPARK-1240: handle the case of empty RDD when takeSampleCodingCat2014-03-161-0/+4
| | | | | | | | | | | | | | | | | | | | https://spark-project.atlassian.net/browse/SPARK-1240 It seems that the current implementation does not handle the empty RDD case when run takeSample In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value In the test case, I also add several lines for this case Author: CodingCat <zhunansjtu@gmail.com> Closes #135 from CodingCat/SPARK-1240 and squashes the following commits: fef57d4 [CodingCat] fix the same problem in PySpark 36db06b [CodingCat] create new test cases for takeSample from an empty red 810948d [CodingCat] further fix a40e8fb [CodingCat] replace if with require ad483fd [CodingCat] handle the case with empty RDD when take sample
* SPARK-1162 Added top in python.Prashant Sharma2014-03-121-0/+25
| | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Closes #93 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered and squashes the following commits: ece1fa4 [Prashant Sharma] Added top in python.
* Spark-1163, Added missing Python RDD functionsprabinb2014-03-111-0/+42
| | | | | | | | Author: prabinb <prabin.banka@imaginea.com> Closes #92 from prabinb/python-api-rdd and squashes the following commits: 51129ca [prabinb] Added missing Python RDD functions Added __repr__ function to StorageLevel class. Added doctest for RDD.getStorageLevel().
* SPARK-1168, Added foldByKey to pyspark.Prashant Sharma2014-03-101-0/+14
| | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Closes #115 from ScrapCodes/SPARK-1168/pyspark-foldByKey and squashes the following commits: db6f67e [Prashant Sharma] SPARK-1168, Added foldByKey to pyspark.
* [SPARK-972] Added detailed callsite info for ValueError in context.py ↵jyotiska2014-03-101-7/+14
| | | | | | | | | | | (resubmitted) Author: jyotiska <jyotiska123@gmail.com> Closes #34 from jyotiska/pyspark_code and squashes the following commits: c9439be [jyotiska] replaced dict with namedtuple a6bf4cd [jyotiska] added callsite info for context.py
* SPARK-977 Added Python RDD.zip functionPrabin Banka2014-03-101-1/+19
| | | | | | | | | | was raised earlier as a part of apache/incubator-spark#486 Author: Prabin Banka <prabin.banka@imaginea.com> Closes #76 from prabinb/python-api-zip and squashes the following commits: b1a31a0 [Prabin Banka] Added Python RDD.zip function
* Spark 1165 rdd.intersection in python and javaPrashant Sharma2014-03-071-0/+17
| | | | | | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Author: Prashant Sharma <scrapcodes@gmail.com> Closes #80 from ScrapCodes/SPARK-1165/RDD.intersection and squashes the following commits: 9b015e9 [Prashant Sharma] Added a note, shuffle is required for intersection. 1fea813 [Prashant Sharma] correct the lines wrapping d0c71f3 [Prashant Sharma] SPARK-1165 RDD.intersection in java d6effee [Prashant Sharma] SPARK-1165 Implemented RDD.intersection in python.
* SPARK-1187, Added missing Python APIsPrabin Banka2014-03-061-0/+7
| | | | | | | | | | | | | | | | | The following Python APIs are added, RDD.id() SparkContext.setJobGroup() SparkContext.setLocalProperty() SparkContext.getLocalProperty() SparkContext.sparkUser() was raised earlier as a part of apache/incubator-spark#486 Author: Prabin Banka <prabin.banka@imaginea.com> Closes #75 from prabinb/python-api-backup and squashes the following commits: cc3c6cd [Prabin Banka] Added missing Python APIs
* SPARK-1109 wrong API docs for pyspark map functionPrashant Sharma2014-03-041-1/+1
| | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Closes #73 from ScrapCodes/SPARK-1109/wrong-API-docs and squashes the following commits: 1a55b58 [Prashant Sharma] SPARK-1109 wrong API docs for pyspark map function
* doctest updated for mapValues, flatMapValues in rdd.pyjyotiska2014-02-221-0/+10
| | | | | | | | | | Updated doctests for mapValues and flatMapValues in rdd.py Author: jyotiska <jyotiska123@gmail.com> Closes #621 from jyotiska/python_spark and squashes the following commits: 716f7cd [jyotiska] doctest updated for mapValues, flatMapValues in rdd.py
* Merge pull request #498 from ScrapCodes/python-api. Closes #498.Prashant Sharma2014-02-061-0/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Python api additions Author: Prashant Sharma <prashant.s@imaginea.com> == Merge branch commits == commit 8b51591f1a7a79a62c13ee66ff8d83040f7eccd8 Author: Prashant Sharma <prashant.s@imaginea.com> Date: Fri Jan 24 11:50:29 2014 +0530 Josh's and Patricks review comments. commit d37f9677838e43bef6c18ef61fbf08055ba6d1ca Author: Prashant Sharma <prashant.s@imaginea.com> Date: Thu Jan 23 17:27:17 2014 +0530 fixed doc tests commit 27cb54bf5c99b1ea38a73858c291d0a1c43d8b7c Author: Prashant Sharma <prashant.s@imaginea.com> Date: Thu Jan 23 16:48:43 2014 +0530 Added keys and values methods for PairFunctions in python commit 4ce76b396fbaefef2386d7a36d611572bdef9b5d Author: Prashant Sharma <prashant.s@imaginea.com> Date: Thu Jan 23 13:51:26 2014 +0530 Added foreachPartition commit 05f05341a187cba829ac0e6c2bdf30be49948c89 Author: Prashant Sharma <prashant.s@imaginea.com> Date: Thu Jan 23 13:02:59 2014 +0530 Added coalesce fucntion to python API commit 6568d2c2fa14845dc56322c0f39ba2e13b3b26dd Author: Prashant Sharma <prashant.s@imaginea.com> Date: Thu Jan 23 12:52:44 2014 +0530 added repartition function to python API.
* Deprecate mapPartitionsWithSplit in PySpark.Josh Rosen2014-01-231-4/+21
| | | | | | Also, replace the last reference to it in the docs. This fixes SPARK-1026.
* Make Python function/line appear in the UI.Tor Myklebust2013-12-281-11/+55
|
* Merge pull request #276 from shivaram/collectPartitionReynold Xin2013-12-191-1/+6
|\ | | | | | | | | | | | | | | Add collectPartition to JavaRDD interface. This interface is useful for implementing `take` from other language frontends where the data is serialized. Also remove `takePartition` from PythonRDD and use `collectPartition` in rdd.py. Thanks @concretevitamin for the original change and tests.
| * Make collectPartitions take an array of partitionsShivaram Venkataraman2013-12-191-1/+6
| | | | | | | | | | | | Change the implementation to use runJob instead of PartitionPruningRDD. Also update the unit tests and the python take implementation to use the new interface.
| * Add collectPartition to JavaRDD interface.Shivaram Venkataraman2013-12-181-1/+1
| | | | | | | | Also remove takePartition from PythonRDD and use collectPartition in rdd.py.
* | Add toString to Java RDD, and __repr__ to Python RDDNick Pentreath2013-12-191-0/+3
|/
* Merge branch 'master' into akka-bug-fixPrashant Sharma2013-12-111-1/+4
|\ | | | | | | | | | | | | | | | | | | Conflicts: core/pom.xml core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala pom.xml project/SparkBuild.scala streaming/pom.xml yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
| * Fix UnicodeEncodeError in PySpark saveAsTextFile().Josh Rosen2013-11-281-1/+4
| | | | | | Fixes SPARK-970.
* | Merge branch 'master' into wip-scala-2.10Prashant Sharma2013-11-271-43/+54
|\| | | | | | | | | | | | | | | | | Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala core/src/main/scala/org/apache/spark/rdd/MapPartitionsWithContextRDD.scala core/src/main/scala/org/apache/spark/rdd/RDD.scala python/pyspark/rdd.py
| * FramedSerializer: _dumps => dumps, _loads => loads.Josh Rosen2013-11-101-2/+2
| |
| * Send PySpark commands as bytes insetad of strings.Josh Rosen2013-11-101-6/+6
| |
| * Add custom serializer support to PySpark.Josh Rosen2013-11-101-39/+47
| | | | | | | | | | | | | | | | | | For now, this only adds MarshalSerializer, but it lays the groundwork for other supporting custom serializers. Many of these mechanisms can also be used to support deserialization of different data formats sent by Java, such as data encoded by MsgPack. This also fixes a bug in SparkContext.union().
| * Remove Pickle-wrapping of Java objects in PySpark.Josh Rosen2013-11-031-4/+7
| | | | | | | | | | | | If we support custom serializers, the Python worker will know what type of input to expect, so we won't need to wrap Tuple2 and Strings into pickled tuples and strings.
* | Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10Prashant Sharma2013-10-101-7/+53
|\|
| * Fix PySpark docs and an overly long line of code after fdbae41eMatei Zaharia2013-10-091-8/+8
| |
| * SPARK-705: implement sortByKey() in PySparkAndre Schumacher2013-10-071-1/+47
| |
* | Merge branch 'master' into wip-merge-masterPrashant Sharma2013-10-081-4/+6
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: bagel/pom.xml core/pom.xml core/src/test/scala/org/apache/spark/ui/UISuite.scala examples/pom.xml mllib/pom.xml pom.xml project/SparkBuild.scala repl/pom.xml streaming/pom.xml tools/pom.xml In scala 2.10, a shorter representation is used for naming artifacts so changed to shorter scala version for artifacts and made it a property in pom.
| * Fixing SPARK-602: PythonPartitionerAndre Schumacher2013-10-041-4/+6
| | | | | | | | | | | | | | Currently PythonPartitioner determines partition ID by hashing a byte-array representation of PySpark's key. This PR lets PythonPartitioner use the actual partition ID, which is required e.g. for sorting via PySpark.
* | Merge branch 'master' of git://github.com/mesos/spark into scala-2.10Prashant Sharma2013-09-151-0/+19
|\| | | | | | | | | | | Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala project/SparkBuild.scala
| * Export StorageLevel and refactorAaron Davidson2013-09-071-1/+2
| |
| * SPARK-660: Add StorageLevel support in PythonAaron Davidson2013-09-051-0/+18
| | | | | | | | | | It uses reflection... I am not proud of that fact, but it at least ensures compatibility (sans refactoring of the StorageLevel stuff).
* | Merged with masterPrashant Sharma2013-09-061-20/+188
|\|
| * Merge pull request #861 from AndreSchumacher/pyspark_sampling_functionMatei Zaharia2013-08-311-7/+55
| |\ | | | | | | Pyspark sampling function
| | * RDD sample() and takeSample() prototypes for PySparkAndre Schumacher2013-08-281-7/+55
| | |
| * | PySpark: implementing subtractByKey(), subtract() and keyBy()Andre Schumacher2013-08-281-0/+37
| |/
| * Implementing SPARK-838: Add DoubleRDDFunctions methods to PySparkAndre Schumacher2013-08-211-1/+59
| |
| * Implementing SPARK-878 for PySpark: adding zip and egg files to context and ↵Andre Schumacher2013-08-161-1/+3
| | | | | | | | passing it down to workers which add these to their sys.path
| * Do not inherit master's PYTHONPATH on workers.Josh Rosen2013-07-291-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | This fixes SPARK-832, an issue where PySpark would not work when the master and workers used different SPARK_HOME paths. This change may potentially break code that relied on the master's PYTHONPATH being used on workers. To have custom PYTHONPATH additions used on the workers, users should set a custom PYTHONPATH in spark-env.sh rather than setting it in the shell.
| * Use None instead of empty string as it's slightly smaller/fasterMatei Zaharia2013-07-291-1/+1
| |
| * Optimize Python foreach() to not return as many objectsMatei Zaharia2013-07-291-1/+5
| |
| * Optimize Python take() to not compute entire first partitionMatei Zaharia2013-07-291-6/+9
| |
| * Add Apache license headers and LICENSE and NOTICE filesMatei Zaharia2013-07-161-0/+17
| |
* | PySpark: replacing class manifest by class tag for Scala 2.10.2 inside rdd.pyAndre Schumacher2013-08-301-2/+2
|/
* Fix Python saveAsTextFile doctest to not expect order to be preservedJey Kottalam2013-04-021-1/+1
|
* Change numSplits to numPartitions in PySpark.Josh Rosen2013-02-241-28/+28
|
* Add commutative requirement for 'reduce' to Python docstring.Mark Hamstra2013-02-091-2/+2
|
* Fetch fewer objects in PySpark's take() method.Josh Rosen2013-02-031-0/+4
|
* Fix reporting of PySpark doctest failures.Josh Rosen2013-02-031-1/+3
|