aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark
Commit message (Collapse)AuthorAgeFilesLines
* Merge pull request #542 from markhamstra/versionBump. Closes #542.Mark Hamstra2014-02-081-1/+1
| | | | | | | | | | | | | | | | | | Version number to 1.0.0-SNAPSHOT Since 0.9.0-incubating is done and out the door, we shouldn't be building 0.9.0-incubating-SNAPSHOT anymore. @pwendell Author: Mark Hamstra <markhamstra@gmail.com> == Merge branch commits == commit 1b00a8a7c1a7f251b4bb3774b84b9e64758eaa71 Author: Mark Hamstra <markhamstra@gmail.com> Date: Wed Feb 5 09:30:32 2014 -0800 Version number to 1.0.0-SNAPSHOT
* Merge pull request #498 from ScrapCodes/python-api. Closes #498.Prashant Sharma2014-02-061-0/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Python api additions Author: Prashant Sharma <prashant.s@imaginea.com> == Merge branch commits == commit 8b51591f1a7a79a62c13ee66ff8d83040f7eccd8 Author: Prashant Sharma <prashant.s@imaginea.com> Date: Fri Jan 24 11:50:29 2014 +0530 Josh's and Patricks review comments. commit d37f9677838e43bef6c18ef61fbf08055ba6d1ca Author: Prashant Sharma <prashant.s@imaginea.com> Date: Thu Jan 23 17:27:17 2014 +0530 fixed doc tests commit 27cb54bf5c99b1ea38a73858c291d0a1c43d8b7c Author: Prashant Sharma <prashant.s@imaginea.com> Date: Thu Jan 23 16:48:43 2014 +0530 Added keys and values methods for PairFunctions in python commit 4ce76b396fbaefef2386d7a36d611572bdef9b5d Author: Prashant Sharma <prashant.s@imaginea.com> Date: Thu Jan 23 13:51:26 2014 +0530 Added foreachPartition commit 05f05341a187cba829ac0e6c2bdf30be49948c89 Author: Prashant Sharma <prashant.s@imaginea.com> Date: Thu Jan 23 13:02:59 2014 +0530 Added coalesce fucntion to python API commit 6568d2c2fa14845dc56322c0f39ba2e13b3b26dd Author: Prashant Sharma <prashant.s@imaginea.com> Date: Thu Jan 23 12:52:44 2014 +0530 added repartition function to python API.
* Switch from MUTF8 to UTF8 in PySpark serializers.Josh Rosen2014-01-283-9/+9
| | | | | | | | | This fixes SPARK-1043, a bug introduced in 0.9.0 where PySpark couldn't serialize strings > 64kB. This fix was written by @tyro89 and @bouk in #512. This commit squashes and rebases their pull request in order to fix some merge conflicts.
* Merge pull request #504 from JoshRosen/SPARK-1025Reynold Xin2014-01-251-0/+11
|\ | | | | | | | | | | Fix PySpark hang when input files are deleted (SPARK-1025) This pull request addresses [SPARK-1025](https://spark-project.atlassian.net/browse/SPARK-1025), an issue where PySpark could hang if its input files were deleted.
| * Fix for SPARK-1025: PySpark hang on missing files.Josh Rosen2014-01-231-0/+11
| |
* | Deprecate mapPartitionsWithSplit in PySpark.Josh Rosen2014-01-231-4/+21
|/ | | | | | Also, replace the last reference to it in the docs. This fixes SPARK-1026.
* Fix SPARK-978: ClassCastException in PySpark cartesian.Josh Rosen2014-01-231-0/+9
|
* Fix SPARK-1034: Py4JException on PySpark Cartesian ResultJosh Rosen2014-01-231-0/+7
|
* Merge pull request #426 from mateiz/py-ml-testsPatrick Wendell2014-01-181-0/+10
|\ | | | | | | | | | | Re-enable Python MLlib tests (require Python 2.7 and NumPy 1.7+) We disabled these earlier because Jenkins didn't have these versions.
| * Complain if Python and NumPy versions are too old for MLlibMatei Zaharia2014-01-141-0/+10
| |
* | Merge pull request #462 from mateiz/conf-file-fixPatrick Wendell2014-01-181-6/+4
|/ | | | | | | | | | | | | Remove Typesafe Config usage and conf files to fix nested property names With Typesafe Config we had the subtle problem of no longer allowing nested property names, which are used for a few of our properties: http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html This PR is for branch 0.9 but should be added into master too. (cherry picked from commit 34e911ce9a9f91f3259189861779032069257852) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* Log Python exceptions to stderr as wellMatei Zaharia2014-01-121-0/+4
| | | | | | This helps in case the exception happened while serializing a record to be sent to Java, leaving the stream to Java in an inconsistent state where PythonRDD won't be able to read the error.
* Update some Python MLlib parameters to use camelCase, and tweak docsMatei Zaharia2014-01-112-21/+21
| | | | | | | We've used camel case in other Spark methods so it felt reasonable to keep using it here and make the code match Scala/Java as much as possible. Note that parameter names matter in Python because it allows passing optional parameters by name.
* Add Naive Bayes to Python MLlib, and some API fixesMatei Zaharia2014-01-115-23/+82
| | | | | | | | | | | | - Added a Python wrapper for Naive Bayes - Updated the Scala Naive Bayes to match the style of our other algorithms better and in particular make it easier to call from Java (added builder pattern, removed default value in train method) - Updated Python MLlib functions to not require a SparkContext; we can get that from the RDD the user gives - Added a toString method in LabeledPoint - Made the Python MLlib tests run as part of run-tests as well (before they could only be run individually through each file)
* Merge branch 'master' into MatrixFactorizationModel-fixHossein Falaki2014-01-073-3/+3
|\
| * Merge remote-tracking branch 'apache-github/master' into remove-binariesPatrick Wendell2014-01-032-2/+2
| |\ | | | | | | | | | | | | | | | Conflicts: core/src/test/scala/org/apache/spark/DriverSuite.scala docs/python-programming-guide.md
| | * Merge pull request #317 from ScrapCodes/spark-915-segregate-scriptsPatrick Wendell2014-01-032-2/+2
| | |\ | | | | | | | | | | | | Spark-915 segregate scripts
| | | * sbin/spark-class* -> bin/spark-class*Prashant Sharma2014-01-031-1/+1
| | | |
| | | * pyspark -> bin/pysparkPrashant Sharma2014-01-021-1/+1
| | | |
| | | * Merge branch 'scripts-reorg' of github.com:shane-huang/incubator-spark into ↵Prashant Sharma2014-01-021-1/+1
| | | |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | spark-915-segregate-scripts Conflicts: bin/spark-shell core/pom.xml core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala core/src/main/scala/org/apache/spark/ui/UIWorkloadGenerator.scala core/src/test/scala/org/apache/spark/DriverSuite.scala python/run-tests sbin/compute-classpath.sh sbin/spark-class sbin/stop-slaves.sh
| | | | * Merge branch 'reorgscripts' into scripts-reorgshane-huang2013-09-271-1/+1
| | | | |\
| | | | | * fix paths and change spark to use APP_MEM as application driver memory ↵shane-huang2013-09-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | instead of SPARK_MEM, user should add application jars to SPARK_CLASSPATH Signed-off-by: shane-huang <shengsheng.huang@intel.com>
| | | | | * added spark-class and spark-executor to sbinshane-huang2013-09-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: shane-huang <shengsheng.huang@intel.com>
| * | | | | Changes on top of Prashant's patch.Patrick Wendell2014-01-031-1/+1
| |/ / / / | | | | | | | | | | | | | | | Closes #316
* | | | | Added predictAll python function to MatrixFactorizationModelHossein Falaki2014-01-061-4/+6
| | | | |
* | | | | Added Rating deserializerHossein Falaki2014-01-061-3/+18
| | | | |
* | | | | Added python binding for bulk recommendationHossein Falaki2014-01-042-1/+19
|/ / / /
* | | | Merge pull request #311 from tmyklebu/masterMatei Zaharia2014-01-021-11/+55
|\ \ \ \ | |/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SPARK-991: Report information gleaned from a Python stacktrace in the UI Scala: - Added setCallSite/clearCallSite to SparkContext and JavaSparkContext. These functions mutate a LocalProperty called "externalCallSite." - Add a wrapper, getCallSite, that checks for an externalCallSite and, if none is found, calls the usual Utils.formatSparkCallSite. - Change everything that calls Utils.formatSparkCallSite to call getCallSite instead. Except getCallSite. - Add wrappers to setCallSite/clearCallSite wrappers to JavaSparkContext. Python: - Add a gruesome hack to rdd.py that inspects the traceback and guesses what you want to see in the UI. - Add a RAII wrapper around said gruesome hack that calls setCallSite/clearCallSite as appropriate. - Wire said RAII wrapper up around three calls into the Scala code. I'm not sure that I hit all the spots with the RAII wrapper. I'm also not sure that my gruesome hack does exactly what we want. One could also approach this change by refactoring runJob/submitJob/runApproximateJob to take a call site, then threading that parameter through everything that needs to know it. One might object to the pointless-looking wrappers in JavaSparkContext. Unfortunately, I can't directly access the SparkContext from Python---or, if I can, I don't know how---so I need to wrap everything that matters in JavaSparkContext. Conflicts: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
| * | | Make Python function/line appear in the UI.Tor Myklebust2013-12-281-11/+55
| | | |
* | | | Fix Python code after change of getOrElseMatei Zaharia2014-01-012-7/+14
| | | |
* | | | Miscellaneous fixes from code review.Matei Zaharia2014-01-011-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | Also replaced SparkConf.getOrElse with just a "get" that takes a default value, and added getInt, getLong, etc to make code that uses this simpler later on.
* | | | Merge remote-tracking branch 'apache/master' into conf2Matei Zaharia2013-12-312-9/+4
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
| * \ \ \ Merge pull request #289 from tdas/filestream-fixPatrick Wendell2013-12-312-9/+4
| |\ \ \ \ | | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bug fixes for file input stream and checkpointing - Fixed bugs in the file input stream that led the stream to fail due to transient HDFS errors (listing files when a background thread it deleting fails caused errors, etc.) - Updated Spark's CheckpointRDD and Streaming's CheckpointWriter to use SparkContext.hadoopConfiguration, to allow checkpoints to be written to any HDFS compatible store requiring special configuration. - Changed the API of SparkContext.setCheckpointDir() - eliminated the unnecessary 'useExisting' parameter. Now SparkContext will always create a unique subdirectory within the user specified checkpoint directory. This is to ensure that previous checkpoint files are not accidentally overwritten. - Fixed bug where setting checkpoint directory as a relative local path caused the checkpointing to fail.
| | * | | Fixed Python API for sc.setCheckpointDir. Also other fixes based on ↵Tathagata Das2013-12-242-9/+4
| | | | | | | | | | | | | | | | | | | | Reynold's comments on PR 289.
* | | | | Updated docs for SparkConf and handled review commentsMatei Zaharia2013-12-302-17/+31
| | | | |
* | | | | Properly show Spark properties on web UI, and change app name propertyMatei Zaharia2013-12-292-3/+3
| | | | |
* | | | | Fix some Python docs and make sure to unset SPARK_TESTING in PythonMatei Zaharia2013-12-294-20/+35
| | | | | | | | | | | | | | | | | | | | tests so we don't get the test spark.conf on the classpath.
* | | | | Merge remote-tracking branch 'origin/master' into conf2Matei Zaharia2013-12-299-2/+599
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala core/src/main/scala/org/apache/spark/util/MetadataCleaner.scala core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala new-yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala streaming/src/test/scala/org/apache/spark/streaming/WindowOperationsSuite.scala
| * | | | Merge pull request #283 from tmyklebu/masterMatei Zaharia2013-12-268-1/+598
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Python bindings for mllib This pull request contains Python bindings for the regression, clustering, classification, and recommendation tools in mllib. For each 'train' frontend exposed, there is a Scala stub in PythonMLLibAPI.scala and a Python stub in mllib.py. The Python stub serialises the input RDD and any vector/matrix arguments into a mutually-understood format and calls the Scala stub. The Scala stub deserialises the RDD and the vector/matrix arguments, calls the appropriate 'train' function, serialises the resulting model, and returns the serialised model. ALSModel is slightly different since a MatrixFactorizationModel has RDDs inside. The Scala stub returns a handle to a Scala MatrixFactorizationModel; prediction is done by calling the Scala predict method. I have tested these bindings on an x86_64 machine running Linux. There is a risk that these bindings may fail on some choose-your-own-endian platform if Python's endian differs from java.nio.ByteBuffer's idea of the native byte order.
| | * | | | Remove commented code in __init__.py.Tor Myklebust2013-12-251-8/+0
| | | | | |
| | * | | | Fix copypasta in __init__.py. Don't import anything directly into ↵Tor Myklebust2013-12-251-26/+8
| | | | | | | | | | | | | | | | | | | | | | | | pyspark.mllib.
| | * | | | Initial weights in Scala are ones; do that too. Also fix some errors.Tor Myklebust2013-12-251-6/+6
| | | | | |
| | * | | | Split the mllib bindings into a whole bunch of modules and rename some things.Tor Myklebust2013-12-257-183/+409
| | | | | |
| | * | | | Remove useless line from test stub.Tor Myklebust2013-12-241-1/+0
| | | | | |
| | * | | | Python change for move of PythonMLLibAPI.Tor Myklebust2013-12-241-1/+1
| | | | | |
| | * | | | Release JVM reference to the ALSModel when done.Tor Myklebust2013-12-221-2/+2
| | | | | |
| | * | | | Python stubs for ALSModel.Tor Myklebust2013-12-212-8/+56
| | | | | |
| | * | | | Un-semicolon mllib.py.Tor Myklebust2013-12-201-11/+11
| | | | | |
| | * | | | Change some docstrings and add some others.Tor Myklebust2013-12-201-1/+3
| | | | | |
| | * | | | Licence notice.Tor Myklebust2013-12-201-0/+17
| | | | | |