aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark/rdd.py
Commit message (Collapse)AuthorAgeFilesLines
* Make Python function/line appear in the UI.Tor Myklebust2013-12-281-11/+55
|
* Merge pull request #276 from shivaram/collectPartitionReynold Xin2013-12-191-1/+6
|\ | | | | | | | | | | | | | | Add collectPartition to JavaRDD interface. This interface is useful for implementing `take` from other language frontends where the data is serialized. Also remove `takePartition` from PythonRDD and use `collectPartition` in rdd.py. Thanks @concretevitamin for the original change and tests.
| * Make collectPartitions take an array of partitionsShivaram Venkataraman2013-12-191-1/+6
| | | | | | | | | | | | Change the implementation to use runJob instead of PartitionPruningRDD. Also update the unit tests and the python take implementation to use the new interface.
| * Add collectPartition to JavaRDD interface.Shivaram Venkataraman2013-12-181-1/+1
| | | | | | | | Also remove takePartition from PythonRDD and use collectPartition in rdd.py.
* | Add toString to Java RDD, and __repr__ to Python RDDNick Pentreath2013-12-191-0/+3
|/
* Merge branch 'master' into akka-bug-fixPrashant Sharma2013-12-111-1/+4
|\ | | | | | | | | | | | | | | | | | | Conflicts: core/pom.xml core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala pom.xml project/SparkBuild.scala streaming/pom.xml yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
| * Fix UnicodeEncodeError in PySpark saveAsTextFile().Josh Rosen2013-11-281-1/+4
| | | | | | Fixes SPARK-970.
* | Merge branch 'master' into wip-scala-2.10Prashant Sharma2013-11-271-43/+54
|\| | | | | | | | | | | | | | | | | Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala core/src/main/scala/org/apache/spark/rdd/MapPartitionsWithContextRDD.scala core/src/main/scala/org/apache/spark/rdd/RDD.scala python/pyspark/rdd.py
| * FramedSerializer: _dumps => dumps, _loads => loads.Josh Rosen2013-11-101-2/+2
| |
| * Send PySpark commands as bytes insetad of strings.Josh Rosen2013-11-101-6/+6
| |
| * Add custom serializer support to PySpark.Josh Rosen2013-11-101-39/+47
| | | | | | | | | | | | | | | | | | For now, this only adds MarshalSerializer, but it lays the groundwork for other supporting custom serializers. Many of these mechanisms can also be used to support deserialization of different data formats sent by Java, such as data encoded by MsgPack. This also fixes a bug in SparkContext.union().
| * Remove Pickle-wrapping of Java objects in PySpark.Josh Rosen2013-11-031-4/+7
| | | | | | | | | | | | If we support custom serializers, the Python worker will know what type of input to expect, so we won't need to wrap Tuple2 and Strings into pickled tuples and strings.
* | Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10Prashant Sharma2013-10-101-7/+53
|\|
| * Fix PySpark docs and an overly long line of code after fdbae41eMatei Zaharia2013-10-091-8/+8
| |
| * SPARK-705: implement sortByKey() in PySparkAndre Schumacher2013-10-071-1/+47
| |
* | Merge branch 'master' into wip-merge-masterPrashant Sharma2013-10-081-4/+6
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: bagel/pom.xml core/pom.xml core/src/test/scala/org/apache/spark/ui/UISuite.scala examples/pom.xml mllib/pom.xml pom.xml project/SparkBuild.scala repl/pom.xml streaming/pom.xml tools/pom.xml In scala 2.10, a shorter representation is used for naming artifacts so changed to shorter scala version for artifacts and made it a property in pom.
| * Fixing SPARK-602: PythonPartitionerAndre Schumacher2013-10-041-4/+6
| | | | | | | | | | | | | | Currently PythonPartitioner determines partition ID by hashing a byte-array representation of PySpark's key. This PR lets PythonPartitioner use the actual partition ID, which is required e.g. for sorting via PySpark.
* | Merge branch 'master' of git://github.com/mesos/spark into scala-2.10Prashant Sharma2013-09-151-0/+19
|\| | | | | | | | | | | Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala project/SparkBuild.scala
| * Export StorageLevel and refactorAaron Davidson2013-09-071-1/+2
| |
| * SPARK-660: Add StorageLevel support in PythonAaron Davidson2013-09-051-0/+18
| | | | | | | | | | It uses reflection... I am not proud of that fact, but it at least ensures compatibility (sans refactoring of the StorageLevel stuff).
* | Merged with masterPrashant Sharma2013-09-061-20/+188
|\|
| * Merge pull request #861 from AndreSchumacher/pyspark_sampling_functionMatei Zaharia2013-08-311-7/+55
| |\ | | | | | | Pyspark sampling function
| | * RDD sample() and takeSample() prototypes for PySparkAndre Schumacher2013-08-281-7/+55
| | |
| * | PySpark: implementing subtractByKey(), subtract() and keyBy()Andre Schumacher2013-08-281-0/+37
| |/
| * Implementing SPARK-838: Add DoubleRDDFunctions methods to PySparkAndre Schumacher2013-08-211-1/+59
| |
| * Implementing SPARK-878 for PySpark: adding zip and egg files to context and ↵Andre Schumacher2013-08-161-1/+3
| | | | | | | | passing it down to workers which add these to their sys.path
| * Do not inherit master's PYTHONPATH on workers.Josh Rosen2013-07-291-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | This fixes SPARK-832, an issue where PySpark would not work when the master and workers used different SPARK_HOME paths. This change may potentially break code that relied on the master's PYTHONPATH being used on workers. To have custom PYTHONPATH additions used on the workers, users should set a custom PYTHONPATH in spark-env.sh rather than setting it in the shell.
| * Use None instead of empty string as it's slightly smaller/fasterMatei Zaharia2013-07-291-1/+1
| |
| * Optimize Python foreach() to not return as many objectsMatei Zaharia2013-07-291-1/+5
| |
| * Optimize Python take() to not compute entire first partitionMatei Zaharia2013-07-291-6/+9
| |
| * Add Apache license headers and LICENSE and NOTICE filesMatei Zaharia2013-07-161-0/+17
| |
* | PySpark: replacing class manifest by class tag for Scala 2.10.2 inside rdd.pyAndre Schumacher2013-08-301-2/+2
|/
* Fix Python saveAsTextFile doctest to not expect order to be preservedJey Kottalam2013-04-021-1/+1
|
* Change numSplits to numPartitions in PySpark.Josh Rosen2013-02-241-28/+28
|
* Add commutative requirement for 'reduce' to Python docstring.Mark Hamstra2013-02-091-2/+2
|
* Fetch fewer objects in PySpark's take() method.Josh Rosen2013-02-031-0/+4
|
* Fix reporting of PySpark doctest failures.Josh Rosen2013-02-031-1/+3
|
* Use spark.local.dir for PySpark temp files (SPARK-580).Josh Rosen2013-02-011-6/+1
|
* Do not launch JavaGateways on workers (SPARK-674).Josh Rosen2013-02-011-6/+6
| | | | | | | | | | | The problem was that the gateway was being initialized whenever the pyspark.context module was loaded. The fix uses lazy initialization that occurs only when SparkContext instances are actually constructed. I also made the gateway and jvm variables private. This change results in ~3-4x performance improvement when running the PySpark unit tests.
* Merge pull request #389 from JoshRosen/python_rdd_checkpointingMatei Zaharia2013-01-201-1/+34
|\ | | | | Add checkpointing to the Python API
| * Clean up setup code in PySpark checkpointing testsJosh Rosen2013-01-201-2/+1
| |
| * Update checkpointing API docs in Python/Java.Josh Rosen2013-01-201-12/+5
| |
| * Add checkpointFile() and more tests to PySpark.Josh Rosen2013-01-201-1/+8
| |
| * Add RDD checkpointing to Python API.Josh Rosen2013-01-201-0/+34
| |
* | Fix PythonPartitioner equality; see SPARK-654.Josh Rosen2013-01-201-6/+11
|/ | | | | | PythonPartitioner did not take the Python-side partitioning function into account when checking for equality, which might cause problems in the future.
* Added accumulators to PySparkMatei Zaharia2013-01-201-1/+1
|
* Add mapPartitionsWithSplit() to PySpark.Josh Rosen2013-01-081-11/+22
|
* Change PySpark RDD.take() to not call iterator().Josh Rosen2013-01-031-6/+5
|
* Rename top-level 'pyspark' directory to 'python'Josh Rosen2013-01-011-0/+713