aboutsummaryrefslogtreecommitdiff
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* Set spark.executor.uri from environment variable (needed by Mesos)Ivan Wick2014-04-101-0/+3
| | | | | | | | | | | | | The Mesos backend uses this property when setting up a slave process. It is similarly set in the Scala repl (org.apache.spark.repl.SparkILoop), but I couldn't find any analogous for pyspark. Author: Ivan Wick <ivanwick+github@gmail.com> This patch had conflicts when merged, resolved by Committer: Matei Zaharia <matei@databricks.com> Closes #311 from ivanwick/master and squashes the following commits: da0c3e4 [Ivan Wick] Set spark.executor.uri from environment variable (needed by Mesos)
* SPARK-1428: MLlib should convert non-float64 NumPy arrays to float64 instead ↵Sandeep2014-04-101-4/+14
| | | | | | | | | | of complaining Author: Sandeep <sandeep@techaddict.me> Closes #356 from techaddict/1428 and squashes the following commits: 3bdf5f6 [Sandeep] SPARK-1428: MLlib should convert non-float64 NumPy arrays to float64 instead of complaining
* Spark 1271: Co-Group and Group-By should pass Iterable[X]Holden Karau2014-04-083-7/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Author: Holden Karau <holden@pigscanfly.ca> Closes #242 from holdenk/spark-1320-cogroupandgroupshouldpassiterator and squashes the following commits: f289536 [Holden Karau] Fix bad merge, should have been Iterable rather than Iterator 77048f8 [Holden Karau] Fix merge up to master d3fe909 [Holden Karau] use toSeq instead 7a092a3 [Holden Karau] switch resultitr to resultiterable eb06216 [Holden Karau] maybe I should have had a coffee first. use correct import for guava iterables c5075aa [Holden Karau] If guava 14 had iterables 2d06e10 [Holden Karau] Fix Java 8 cogroup tests for the new API 11e730c [Holden Karau] Fix streaming tests 66b583d [Holden Karau] Fix the core test suite to compile 4ed579b [Holden Karau] Refactor from iterator to iterable d052c07 [Holden Karau] Python tests now pass with iterator pandas 3bcd81d [Holden Karau] Revert "Try and make pickling list iterators work" cd1e81c [Holden Karau] Try and make pickling list iterators work c60233a [Holden Karau] Start investigating moving to iterators for python API like the Java/Scala one. tl;dr: We will have to write our own iterator since the default one doesn't pickle well 88a5cef [Holden Karau] Fix cogroup test in JavaAPISuite for streaming a5ee714 [Holden Karau] oops, was checking wrong iterator e687f21 [Holden Karau] Fix groupbykey test in JavaAPISuite of streaming ec8cc3e [Holden Karau] Fix test issues\! 4b0eeb9 [Holden Karau] Switch cast in PairDStreamFunctions fa395c9 [Holden Karau] Revert "Add a join based on the problem in SVD" ec99e32 [Holden Karau] Revert "Revert this but for now put things in list pandas" b692868 [Holden Karau] Revert 7e533f7 [Holden Karau] Fix the bug 8a5153a [Holden Karau] Revert me, but we have some stuff to debug b4e86a9 [Holden Karau] Add a join based on the problem in SVD c4510e2 [Holden Karau] Revert this but for now put things in list pandas b4e0b1d [Holden Karau] Fix style issues 71e8b9f [Holden Karau] I really need to stop calling size on iterators, it is the path of sadness. b1ae51a [Holden Karau] Fix some of the types in the streaming JavaAPI suite. Probably still needs more work 37888ec [Holden Karau] core/tests now pass 249abde [Holden Karau] org.apache.spark.rdd.PairRDDFunctionsSuite passes 6698186 [Holden Karau] Revert "I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy" fe992fe [Holden Karau] hmmm try and fix up basic operation suite 172705c [Holden Karau] Fix Java API suite caafa63 [Holden Karau] I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy 88b3329 [Holden Karau] Fix groupbykey to actually give back an iterator 4991af6 [Holden Karau] Fix some tests be50246 [Holden Karau] Calling size on an iterator is not so good if we want to use it after 687ffbc [Holden Karau] This is the it compiles point of replacing Seq with Iterator and JList with JIterator in the groupby and cogroup signatures
* SPARK-1099: Introduce local[*] mode to infer number of coresAaron Davidson2014-04-071-1/+1
| | | | | | | | | | | This is the default mode for running spark-shell and pyspark, intended to allow users running spark for the first time to see the performance benefits of using multiple cores, while not breaking backwards compatibility for users who use "local" mode and expect exactly 1 core. Author: Aaron Davidson <aaron@databricks.com> Closes #182 from aarondav/110 and squashes the following commits: a88294c [Aaron Davidson] Rebased changes for new spark-shell a9f393e [Aaron Davidson] SPARK-1099: Introduce local[*] mode to infer number of cores
* SPARK-1421. Make MLlib work on Python 2.6Matei Zaharia2014-04-052-6/+11
| | | | | | | | | | | The reason it wasn't working was passing a bytearray to stream.write(), which is not supported in Python 2.6 but is in 2.7. (This array came from NumPy when we converted data to send it over to Java). Now we just convert those bytearrays to strings of bytes, which preserves nonprintable characters as well. Author: Matei Zaharia <matei@databricks.com> Closes #335 from mateiz/mllib-python-2.6 and squashes the following commits: f26c59f [Matei Zaharia] Update docs to no longer say we need Python 2.7 a84d6af [Matei Zaharia] SPARK-1421. Make MLlib work on Python 2.6
* SPARK-1305: Support persisting RDD's directly to TachyonHaoyuan Li2014-04-043-16/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the PR#468 of apache-incubator-spark to the apache-spark "Adding an option to persist Spark RDD blocks into Tachyon." Author: Haoyuan Li <haoyuan@cs.berkeley.edu> Author: RongGu <gurongwalker@gmail.com> Closes #158 from RongGu/master and squashes the following commits: 72b7768 [Haoyuan Li] merge master 9f7fa1b [Haoyuan Li] fix code style ae7834b [Haoyuan Li] minor cleanup a8b3ec6 [Haoyuan Li] merge master branch e0f4891 [Haoyuan Li] better check offheap. 55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel 7cd4600 [RongGu] remove some logic code for tachyonstore's replication 51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore 8adfcfa [RongGu] address arron's comment on inTachyonSize 120e48a [RongGu] changed the root-level dir name in Tachyon 5cc041c [Haoyuan Li] address aaron's comments 9b97935 [Haoyuan Li] address aaron's comments d9a6438 [Haoyuan Li] fix for pspark 77d2703 [Haoyuan Li] change python api.git status 3dcace4 [Haoyuan Li] address matei's comments 91fa09d [Haoyuan Li] address patrick's comments 589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE 64348b2 [Haoyuan Li] update conf docs. ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1 619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler 49cc724 [Haoyuan Li] update docs with off_headp option 4572f9f [RongGu] reserving the old apply function API of StorageLevel 04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP 76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix 939e467 [Haoyuan Li] 0.4.1-thrift from maven central 86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1 16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1 bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem 6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1 d827250 [RongGu] fix JsonProtocolSuie test failure 716e93b [Haoyuan Li] revert the version ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift 2825a13 [RongGu] up-merging to the current master branch of the apache spark 6a22c1a [Haoyuan Li] fix scalastyle 8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client. 77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice. 1dcadf9 [Haoyuan Li] typo bf278fa [Haoyuan Li] fix python tests e82909c [Haoyuan Li] minor cleanup 776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR 8859371 [Haoyuan Li] various minor fixes and clean up e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode. fcaeab2 [Haoyuan Li] address Aaron's comment e554b1e [Haoyuan Li] add python code 47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels. dc8ef24 [Haoyuan Li] add old storelevel constructor e01a271 [Haoyuan Li] update tachyon 0.4.1 8011a96 [RongGu] fix a brought-in mistake in StorageLevel 70ca182 [RongGu] a bit change in comment 556978b [RongGu] fix the scalastyle errors 791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark
* SPARK-1414. Python API for SparkContext.wholeTextFilesMatei Zaharia2014-04-042-3/+43
| | | | | | | | | | Also clarified comment on each file having to fit in memory Author: Matei Zaharia <matei@databricks.com> Closes #327 from mateiz/py-whole-files and squashes the following commits: 9ad64a5 [Matei Zaharia] SPARK-1414. Python API for SparkContext.wholeTextFiles
* Spark 1162 Implemented takeOrdered in pyspark.Prashant Sharma2014-04-031-5/+102
| | | | | | | | | | | | | | | Since python does not have a library for max heap and usual tricks like inverting values etc.. does not work for all cases. We have our own implementation of max heap. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #97 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered2 and squashes the following commits: 35f86ba [Prashant Sharma] code review 2b1124d [Prashant Sharma] fixed tests e8a08e2 [Prashant Sharma] Code review comments. 49e6ba7 [Prashant Sharma] SPARK-1162 added takeOrdered to pyspark
* [SPARK-1212, Part II] Support sparse data in MLlibXiangrui Meng2014-04-021-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In PR https://github.com/apache/spark/pull/117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other `Array[Double]` usage by `Vector` in generalized linear models (GLMs) and Naive Bayes. Major changes: 1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`. 2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We cannot support both in an elegant way because of type erasure. 3. Mark 'createModel' and 'predictPoint' protected because they are not for end users. 4. Add libSVMFile to MLContext. 5. NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's `NaiveBayesModel`). 6. Gradient computation no longer creates temp vectors. 7. Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training. TODO: 1. ~~Use axpy when possible.~~ 2. ~~Optimize Naive Bayes.~~ Author: Xiangrui Meng <meng@databricks.com> Closes #245 from mengxr/vector and squashes the following commits: eb6e793 [Xiangrui Meng] move libSVMFile to MLUtils and rename to loadLibSVMData c26c4fc [Xiangrui Meng] update DecisionTree to use RDD[Vector] 11999c7 [Xiangrui Meng] Merge branch 'master' into vector f7da54b [Xiangrui Meng] add minSplits to libSVMFile da25e24 [Xiangrui Meng] revert the change to default addIntercept because it might change the behavior of existing code without warning 493f26f [Xiangrui Meng] Merge branch 'master' into vector 7c1bc01 [Xiangrui Meng] add a TODO to NB b9b7ef7 [Xiangrui Meng] change default value of addIntercept to false b01df54 [Xiangrui Meng] allow to change or clear threshold in LR and SVM 4addc50 [Xiangrui Meng] merge master 4ca5b1b [Xiangrui Meng] remove normalization from Lasso and update tests f04fe8a [Xiangrui Meng] remove normalization from RidgeRegression and update tests d088552 [Xiangrui Meng] use static constructor for MLContext 6f59eed [Xiangrui Meng] update libSVMFile to determine number of features automatically 3432e84 [Xiangrui Meng] update NaiveBayes to support sparse data 0f8759b [Xiangrui Meng] minor updates to NB b11659c [Xiangrui Meng] style update 78c4671 [Xiangrui Meng] add libSVMFile to MLContext f0fe616 [Xiangrui Meng] add a test for sparse linear regression 44733e1 [Xiangrui Meng] use in-place gradient computation e981396 [Xiangrui Meng] use axpy in Updater db808a1 [Xiangrui Meng] update JavaLR example befa592 [Xiangrui Meng] passed scala/java tests 75c83a4 [Xiangrui Meng] passed test compile 1859701 [Xiangrui Meng] passed compile 834ada2 [Xiangrui Meng] optimized MLUtils.computeStats update some ml algorithms to use Vector (cont.) 135ab72 [Xiangrui Meng] merge glm 0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used 3f346ba [Xiangrui Meng] update some ml algorithms to use Vector
* SPARK-1336 Reducing the output of run-tests script.Prashant Sharma2014-03-291-7/+12
| | | | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Author: Prashant Sharma <scrapcodes@gmail.com> Closes #262 from ScrapCodes/SPARK-1336/ReduceVerbosity and squashes the following commits: 87dfa54 [Prashant Sharma] Further reduction in noise and made pyspark tests to fail fast. 811170f [Prashant Sharma] Reducing the ouput of run-tests script.
* SPARK-1322, top in pyspark should sort result in descending order.Prashant Sharma2014-03-261-3/+3
| | | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Closes #235 from ScrapCodes/SPARK-1322/top-rev-sort and squashes the following commits: f316266 [Prashant Sharma] Minor change in comment. 58e58c6 [Prashant Sharma] SPARK-1322, top in pyspark should sort result in descending order.
* Added doctest for map function in rdd.pyJyotiska NK2014-03-191-0/+4
| | | | | | | | | | Doctest added for map in rdd.py Author: Jyotiska NK <jyotiska123@gmail.com> Closes #177 from jyotiska/pyspark_rdd_map_doctest and squashes the following commits: a38527f [Jyotiska NK] Added doctest for map function in rdd.py
* Spark 1246 add min max to stat counterDan McClary2014-03-182-3/+41
| | | | | | | | | | | | | | | | | | | | | Here's the addition of min and max to statscounter.py and min and max methods to rdd.py. Author: Dan McClary <dan.mcclary@gmail.com> Closes #144 from dwmclary/SPARK-1246-add-min-max-to-stat-counter and squashes the following commits: fd3fd4b [Dan McClary] fixed error, updated test 82cde0e [Dan McClary] flipped incorrectly assigned inf values in StatCounter 5d96799 [Dan McClary] added max and min to StatCounter repr for pyspark 21dd366 [Dan McClary] added max and min to StatCounter output, updated doc 1a97558 [Dan McClary] added max and min to StatCounter output, updated doc a5c13b0 [Dan McClary] Added min and max to Scala and Java RDD, added min and max to StatCounter ed67136 [Dan McClary] broke min/max out into separate transaction, added to rdd.py 1e7056d [Dan McClary] added underscore to getBucket 37a7dea [Dan McClary] cleaned up boundaries for histogram -- uses real min/max when buckets are derived 29981f2 [Dan McClary] fixed indentation on doctest comment eaf89d9 [Dan McClary] added correct doctest for histogram 4916016 [Dan McClary] added histogram method, added max and min to statscounter
* SPARK-1240: handle the case of empty RDD when takeSampleCodingCat2014-03-161-0/+4
| | | | | | | | | | | | | | | | | | | | https://spark-project.atlassian.net/browse/SPARK-1240 It seems that the current implementation does not handle the empty RDD case when run takeSample In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value In the test case, I also add several lines for this case Author: CodingCat <zhunansjtu@gmail.com> Closes #135 from CodingCat/SPARK-1240 and squashes the following commits: fef57d4 [CodingCat] fix the same problem in PySpark 36db06b [CodingCat] create new test cases for takeSample from an empty red 810948d [CodingCat] further fix a40e8fb [CodingCat] replace if with require ad483fd [CodingCat] handle the case with empty RDD when take sample
* SPARK-1162 Added top in python.Prashant Sharma2014-03-121-0/+25
| | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Closes #93 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered and squashes the following commits: ece1fa4 [Prashant Sharma] Added top in python.
* Spark-1163, Added missing Python RDD functionsprabinb2014-03-112-0/+46
| | | | | | | | Author: prabinb <prabin.banka@imaginea.com> Closes #92 from prabinb/python-api-rdd and squashes the following commits: 51129ca [prabinb] Added missing Python RDD functions Added __repr__ function to StorageLevel class. Added doctest for RDD.getStorageLevel().
* SPARK-1168, Added foldByKey to pyspark.Prashant Sharma2014-03-101-0/+14
| | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Closes #115 from ScrapCodes/SPARK-1168/pyspark-foldByKey and squashes the following commits: db6f67e [Prashant Sharma] SPARK-1168, Added foldByKey to pyspark.
* [SPARK-972] Added detailed callsite info for ValueError in context.py ↵jyotiska2014-03-102-8/+29
| | | | | | | | | | | (resubmitted) Author: jyotiska <jyotiska123@gmail.com> Closes #34 from jyotiska/pyspark_code and squashes the following commits: c9439be [jyotiska] replaced dict with namedtuple a6bf4cd [jyotiska] added callsite info for context.py
* SPARK-977 Added Python RDD.zip functionPrabin Banka2014-03-102-2/+47
| | | | | | | | | | was raised earlier as a part of apache/incubator-spark#486 Author: Prabin Banka <prabin.banka@imaginea.com> Closes #76 from prabinb/python-api-zip and squashes the following commits: b1a31a0 [Prabin Banka] Added Python RDD.zip function
* SPARK-929: Fully deprecate usage of SPARK_MEMAaron Davidson2014-03-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | (Continued from old repo, prior discussion at https://github.com/apache/incubator-spark/pull/615) This patch cements our deprecation of the SPARK_MEM environment variable by replacing it with three more specialized variables: SPARK_DAEMON_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_DRIVER_MEMORY The creation of the latter two variables means that we can safely set driver/job memory without accidentally setting the executor memory. Neither is public. SPARK_EXECUTOR_MEMORY is only used by the Mesos scheduler (and set within SparkContext). The proper way of configuring executor memory is through the "spark.executor.memory" property. SPARK_DRIVER_MEMORY is the new way of specifying the amount of memory run by jobs launched by spark-class, without possibly affecting executor memory. Other memory considerations: - The repl's memory can be set through the "--drivermem" command-line option, which really just sets SPARK_DRIVER_MEMORY. - run-example doesn't use spark-class, so the only way to modify examples' memory is actually an unusual use of SPARK_JAVA_OPTS (which is normally overriden in all cases by spark-class). This patch also fixes a lurking bug where spark-shell misused spark-class (the first argument is supposed to be the main class name, not java options), as well as a bug in the Windows spark-class2.cmd. I have not yet tested this patch on either Windows or Mesos, however. Author: Aaron Davidson <aaron@databricks.com> Closes #99 from aarondav/sparkmem and squashes the following commits: 9df4c68 [Aaron Davidson] SPARK-929: Fully deprecate usage of SPARK_MEM
* Spark 1165 rdd.intersection in python and javaPrashant Sharma2014-03-071-0/+17
| | | | | | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Author: Prashant Sharma <scrapcodes@gmail.com> Closes #80 from ScrapCodes/SPARK-1165/RDD.intersection and squashes the following commits: 9b015e9 [Prashant Sharma] Added a note, shuffle is required for intersection. 1fea813 [Prashant Sharma] correct the lines wrapping d0c71f3 [Prashant Sharma] SPARK-1165 RDD.intersection in java d6effee [Prashant Sharma] SPARK-1165 Implemented RDD.intersection in python.
* SPARK-1187, Added missing Python APIsPrabin Banka2014-03-062-0/+38
| | | | | | | | | | | | | | | | | The following Python APIs are added, RDD.id() SparkContext.setJobGroup() SparkContext.setLocalProperty() SparkContext.getLocalProperty() SparkContext.sparkUser() was raised earlier as a part of apache/incubator-spark#486 Author: Prabin Banka <prabin.banka@imaginea.com> Closes #75 from prabinb/python-api-backup and squashes the following commits: cc3c6cd [Prabin Banka] Added missing Python APIs
* SPARK-1109 wrong API docs for pyspark map functionPrashant Sharma2014-03-041-1/+1
| | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Closes #73 from ScrapCodes/SPARK-1109/wrong-API-docs and squashes the following commits: 1a55b58 [Prashant Sharma] SPARK-1109 wrong API docs for pyspark map function
* SPARK-1115: Catch depickling errorsBouke van der Bijl2014-02-261-24/+24
| | | | | | | | | | | | | This surroungs the complete worker code in a try/except block so we catch any error that arrives. An example would be the depickling failing for some reason @JoshRosen Author: Bouke van der Bijl <boukevanderbijl@gmail.com> Closes #644 from bouk/catch-depickling-errors and squashes the following commits: f0f67cc [Bouke van der Bijl] Lol indentation 0e4d504 [Bouke van der Bijl] Surround the complete python worker with the try block
* doctest updated for mapValues, flatMapValues in rdd.pyjyotiska2014-02-221-0/+10
| | | | | | | | | | Updated doctests for mapValues and flatMapValues in rdd.py Author: jyotiska <jyotiska123@gmail.com> Closes #621 from jyotiska/python_spark and squashes the following commits: 716f7cd [jyotiska] doctest updated for mapValues, flatMapValues in rdd.py
* Fixed minor typo in worker.pyjyotiska2014-02-221-1/+1
| | | | | | | | | | Fixed minor typo in worker.py Author: jyotiska <jyotiska123@gmail.com> Closes #630 from jyotiska/pyspark_code and squashes the following commits: ee44201 [jyotiska] typo fixed in worker.py
* SPARK-1114: Allow PySpark to use existing JVM and GatewayAhir Reddy2014-02-202-10/+22
| | | | | | | | | | Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization. Author: Ahir Reddy <ahirreddy@gmail.com> Closes #622 from ahirreddy/pyspark-existing-jvm and squashes the following commits: a86f457 [Ahir Reddy] Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.
* Merge pull request #562 from jyotiska/master. Closes #562.jyotiska2014-02-081-0/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Added example Python code for sort I added an example Python code for sort. Right now, PySpark has limited examples for new people willing to use the project. This example code sorts integers stored in a file. I was able to sort 5 million, 10 million and 25 million integers with this code. Author: jyotiska <jyotiska123@gmail.com> == Merge branch commits == commit 8ad8faf6c8e02ae1cd68565d98524edf165f54df Author: jyotiska <jyotiska123@gmail.com> Date: Sun Feb 9 11:00:41 2014 +0530 Added comments in code on collect() method commit 6f98f1e313f4472a7c2207d36c4f0fbcebc95a8c Author: jyotiska <jyotiska123@gmail.com> Date: Sat Feb 8 13:12:37 2014 +0530 Updated python example code sort.py commit 945e39a5d68daa7e5bab0d96cbd35d7c4b04eafb Author: jyotiska <jyotiska123@gmail.com> Date: Sat Feb 8 12:59:09 2014 +0530 Added example python code for sort
* Merge pull request #542 from markhamstra/versionBump. Closes #542.Mark Hamstra2014-02-081-1/+1
| | | | | | | | | | | | | | | | | | Version number to 1.0.0-SNAPSHOT Since 0.9.0-incubating is done and out the door, we shouldn't be building 0.9.0-incubating-SNAPSHOT anymore. @pwendell Author: Mark Hamstra <markhamstra@gmail.com> == Merge branch commits == commit 1b00a8a7c1a7f251b4bb3774b84b9e64758eaa71 Author: Mark Hamstra <markhamstra@gmail.com> Date: Wed Feb 5 09:30:32 2014 -0800 Version number to 1.0.0-SNAPSHOT
* Merge pull request #498 from ScrapCodes/python-api. Closes #498.Prashant Sharma2014-02-061-0/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Python api additions Author: Prashant Sharma <prashant.s@imaginea.com> == Merge branch commits == commit 8b51591f1a7a79a62c13ee66ff8d83040f7eccd8 Author: Prashant Sharma <prashant.s@imaginea.com> Date: Fri Jan 24 11:50:29 2014 +0530 Josh's and Patricks review comments. commit d37f9677838e43bef6c18ef61fbf08055ba6d1ca Author: Prashant Sharma <prashant.s@imaginea.com> Date: Thu Jan 23 17:27:17 2014 +0530 fixed doc tests commit 27cb54bf5c99b1ea38a73858c291d0a1c43d8b7c Author: Prashant Sharma <prashant.s@imaginea.com> Date: Thu Jan 23 16:48:43 2014 +0530 Added keys and values methods for PairFunctions in python commit 4ce76b396fbaefef2386d7a36d611572bdef9b5d Author: Prashant Sharma <prashant.s@imaginea.com> Date: Thu Jan 23 13:51:26 2014 +0530 Added foreachPartition commit 05f05341a187cba829ac0e6c2bdf30be49948c89 Author: Prashant Sharma <prashant.s@imaginea.com> Date: Thu Jan 23 13:02:59 2014 +0530 Added coalesce fucntion to python API commit 6568d2c2fa14845dc56322c0f39ba2e13b3b26dd Author: Prashant Sharma <prashant.s@imaginea.com> Date: Thu Jan 23 12:52:44 2014 +0530 added repartition function to python API.
* Switch from MUTF8 to UTF8 in PySpark serializers.Josh Rosen2014-01-283-9/+9
| | | | | | | | | This fixes SPARK-1043, a bug introduced in 0.9.0 where PySpark couldn't serialize strings > 64kB. This fix was written by @tyro89 and @bouk in #512. This commit squashes and rebases their pull request in order to fix some merge conflicts.
* Merge pull request #504 from JoshRosen/SPARK-1025Reynold Xin2014-01-251-0/+11
|\ | | | | | | | | | | Fix PySpark hang when input files are deleted (SPARK-1025) This pull request addresses [SPARK-1025](https://spark-project.atlassian.net/browse/SPARK-1025), an issue where PySpark could hang if its input files were deleted.
| * Fix for SPARK-1025: PySpark hang on missing files.Josh Rosen2014-01-231-0/+11
| |
* | Deprecate mapPartitionsWithSplit in PySpark.Josh Rosen2014-01-231-4/+21
|/ | | | | | Also, replace the last reference to it in the docs. This fixes SPARK-1026.
* Fix SPARK-978: ClassCastException in PySpark cartesian.Josh Rosen2014-01-231-0/+9
|
* Fix SPARK-1034: Py4JException on PySpark Cartesian ResultJosh Rosen2014-01-231-0/+7
|
* Merge pull request #426 from mateiz/py-ml-testsPatrick Wendell2014-01-182-5/+15
|\ | | | | | | | | | | Re-enable Python MLlib tests (require Python 2.7 and NumPy 1.7+) We disabled these earlier because Jenkins didn't have these versions.
| * Complain if Python and NumPy versions are too old for MLlibMatei Zaharia2014-01-141-0/+10
| |
| * Re-enable Python MLlib tests (require Python 2.7 and NumPy 1.7+)Matei Zaharia2014-01-141-5/+5
| |
* | Merge pull request #462 from mateiz/conf-file-fixPatrick Wendell2014-01-181-6/+4
|/ | | | | | | | | | | | | Remove Typesafe Config usage and conf files to fix nested property names With Typesafe Config we had the subtle problem of no longer allowing nested property names, which are used for a few of our properties: http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html This PR is for branch 0.9 but should be added into master too. (cherry picked from commit 34e911ce9a9f91f3259189861779032069257852) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* Disable MLlib tests for now while Jenkins is still on Python 2.6Matei Zaharia2014-01-131-5/+5
|
* Log Python exceptions to stderr as wellMatei Zaharia2014-01-121-0/+4
| | | | | | This helps in case the exception happened while serializing a record to be sent to Java, leaving the stream to Java in an inconsistent state where PythonRDD won't be able to read the error.
* Update some Python MLlib parameters to use camelCase, and tweak docsMatei Zaharia2014-01-112-21/+21
| | | | | | | We've used camel case in other Spark methods so it felt reasonable to keep using it here and make the code match Scala/Java as much as possible. Note that parameter names matter in Python because it allows passing optional parameters by name.
* Add Naive Bayes to Python MLlib, and some API fixesMatei Zaharia2014-01-116-23/+87
| | | | | | | | | | | | - Added a Python wrapper for Naive Bayes - Updated the Scala Naive Bayes to match the style of our other algorithms better and in particular make it easier to call from Java (added builder pattern, removed default value in train method) - Updated Python MLlib functions to not require a SparkContext; we can get that from the RDD the user gives - Added a toString method in LabeledPoint - Made the Python MLlib tests run as part of run-tests as well (before they could only be run individually through each file)
* Merge branch 'master' into MatrixFactorizationModel-fixHossein Falaki2014-01-076-4/+4
|\
| * Merge remote-tracking branch 'apache-github/master' into remove-binariesPatrick Wendell2014-01-033-3/+3
| |\ | | | | | | | | | | | | | | | Conflicts: core/src/test/scala/org/apache/spark/DriverSuite.scala docs/python-programming-guide.md
| | * Merge pull request #317 from ScrapCodes/spark-915-segregate-scriptsPatrick Wendell2014-01-033-3/+3
| | |\ | | | | | | | | | | | | Spark-915 segregate scripts
| | | * sbin/spark-class* -> bin/spark-class*Prashant Sharma2014-01-031-1/+1
| | | |
| | | * pyspark -> bin/pysparkPrashant Sharma2014-01-021-1/+1
| | | |
| | | * Merge branch 'scripts-reorg' of github.com:shane-huang/incubator-spark into ↵Prashant Sharma2014-01-022-2/+2
| | | |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | spark-915-segregate-scripts Conflicts: bin/spark-shell core/pom.xml core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala core/src/main/scala/org/apache/spark/ui/UIWorkloadGenerator.scala core/src/test/scala/org/apache/spark/DriverSuite.scala python/run-tests sbin/compute-classpath.sh sbin/spark-class sbin/stop-slaves.sh