aboutsummaryrefslogtreecommitdiff
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-3681] [SQL] [PySpark] fix serialization of List and Map in SchemaRDDDavies Liu2014-09-272-27/+34
| | | | | | | | | | | | Currently, the schema of object in ArrayType or MapType is attached lazily, it will have better performance but introduce issues while serialization or accessing nested objects. This patch will apply schema to the objects of ArrayType or MapType immediately when accessing them, will be a little bit slower, but much robust. Author: Davies Liu <davies.liu@gmail.com> Closes #2526 from davies/nested and squashes the following commits: 2399ae5 [Davies Liu] fix serialization of List and Map in SchemaRDD
* Revert "[SPARK-3478] [PySpark] Profile the Python tasks"Josh Rosen2014-09-266-108/+7
| | | | This reverts commit 1aa549ba9839565274a12c52fa1075b424f138a6.
* [SPARK-3478] [PySpark] Profile the Python tasksDavies Liu2014-09-266-7/+108
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch add profiling support for PySpark, it will show the profiling results before the driver exits, here is one example: ``` ============================================================ Profile of RDD<id=3> ============================================================ 5146507 function calls (5146487 primitive calls) in 71.094 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 5144576 68.331 0.000 68.331 0.000 statcounter.py:44(merge) 20 2.735 0.137 71.071 3.554 statcounter.py:33(__init__) 20 0.017 0.001 0.017 0.001 {cPickle.dumps} 1024 0.003 0.000 0.003 0.000 t.py:16(<lambda>) 20 0.001 0.000 0.001 0.000 {reduce} 21 0.001 0.000 0.001 0.000 {cPickle.loads} 20 0.001 0.000 0.001 0.000 copy_reg.py:95(_slotnames) 41 0.001 0.000 0.001 0.000 serializers.py:461(read_int) 40 0.001 0.000 0.002 0.000 serializers.py:179(_batched) 62 0.000 0.000 0.000 0.000 {method 'read' of 'file' objects} 20 0.000 0.000 71.072 3.554 rdd.py:863(<lambda>) 20 0.000 0.000 0.001 0.000 serializers.py:198(load_stream) 40/20 0.000 0.000 71.072 3.554 rdd.py:2093(pipeline_func) 41 0.000 0.000 0.002 0.000 serializers.py:130(load_stream) 40 0.000 0.000 71.072 1.777 rdd.py:304(func) 20 0.000 0.000 71.094 3.555 worker.py:82(process) ``` Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk by `sc.dump_profiles(path)`, such as ```python >>> sc._conf.set("spark.python.profile", "true") >>> rdd = sc.parallelize(range(100)).map(str) >>> rdd.count() 100 >>> sc.show_profiles() ============================================================ Profile of RDD<id=1> ============================================================ 284 function calls (276 primitive calls) in 0.001 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 4 0.000 0.000 0.000 0.000 serializers.py:198(load_stream) 4 0.000 0.000 0.000 0.000 {reduce} 12/4 0.000 0.000 0.001 0.000 rdd.py:2092(pipeline_func) 4 0.000 0.000 0.000 0.000 {cPickle.loads} 4 0.000 0.000 0.000 0.000 {cPickle.dumps} 104 0.000 0.000 0.000 0.000 rdd.py:852(<genexpr>) 8 0.000 0.000 0.000 0.000 serializers.py:461(read_int) 12 0.000 0.000 0.000 0.000 rdd.py:303(func) ``` The profiling is disabled by default, can be enabled by "spark.python.profile=true". Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump" Author: Davies Liu <davies.liu@gmail.com> Closes #2351 from davies/profiler and squashes the following commits: 7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles() 2b0daf2 [Davies Liu] fix docs 7a56c24 [Davies Liu] bugfix cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler 116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler 09d02c3 [Davies Liu] Merge branch 'master' into profiler c23865c [Davies Liu] Merge branch 'master' into profiler 15d6f18 [Davies Liu] add docs for two configs dadee1a [Davies Liu] add docs string and clear profiles after show or dump 4f8309d [Davies Liu] address comment, add tests 0a5b6eb [Davies Liu] fix Python UDF 4b20494 [Davies Liu] add profile for python
* [SPARK-546] Add full outer join to RDD and DStream.Aaron Staple2014-09-242-2/+39
| | | | | | | | | | | | | leftOuterJoin and rightOuterJoin are already implemented. This patch adds fullOuterJoin. Author: Aaron Staple <aaron.staple@gmail.com> Closes #1395 from staple/SPARK-546 and squashes the following commits: 1f5595c [Aaron Staple] Fix python style 7ac0aa9 [Aaron Staple] [SPARK-546] Add full outer join to RDD and DStream. 3b5d137 [Aaron Staple] In JavaPairDStream, make class tag specification in rightOuterJoin consistent with other functions. 31f2956 [Aaron Staple] Fix left outer join documentation comments.
* [SPARK-3679] [PySpark] pickle the exact globals of functionsDavies Liu2014-09-242-6/+54
| | | | | | | | | | | | | | function.func_code.co_names has all the names used in the function, including name of attributes. It will pickle some unnecessary globals if there is a global having the same name with attribute (in co_names). There is a regression introduced by #2144, revert part of changes in that PR. cc JoshRosen Author: Davies Liu <davies.liu@gmail.com> Closes #2522 from davies/globals and squashes the following commits: dfbccf5 [Davies Liu] fix bug while pickle globals of function
* [SPARK-3634] [PySpark] User's module should take precedence over system modulesDavies Liu2014-09-244-8/+48
| | | | | | | | | | | | | | | Python modules added through addPyFile should take precedence over system modules. This patch put the path for user added module in the front of sys.path (just after ''). Author: Davies Liu <davies.liu@gmail.com> Closes #2492 from davies/path and squashes the following commits: 4a2af78 [Davies Liu] fix tests f7ff4da [Davies Liu] ad license header 6b0002f [Davies Liu] add tests c16c392 [Davies Liu] put addPyFile in front of sys.path
* [PySpark] remove unnecessary use of numSlices from pyspark testsMatthew Farrellee2014-09-201-2/+2
| | | | | | | | Author: Matthew Farrellee <matt@redhat.com> Closes #2467 from mattf/master-pyspark-remove-numslices-from-tests and squashes the following commits: c49a87b [Matthew Farrellee] [PySpark] remove unnecessary use of numSlices from pyspark tests
* [SPARK-3592] [SQL] [PySpark] support applySchema to RDD of RowDavies Liu2014-09-192-4/+20
| | | | | | | | | | | | | | | Fix the issue when applySchema() to an RDD of Row. Also add type mapping for BinaryType. Author: Davies Liu <davies.liu@gmail.com> Closes #2448 from davies/row and squashes the following commits: dd220cf [Davies Liu] fix test 3f3f188 [Davies Liu] add more test f559746 [Davies Liu] add tests, fix serialization 9688fd2 [Davies Liu] support applySchema to RDD of Row
* [SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlibDavies Liu2014-09-1916-917/+650
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we serialize the data between JVM and Python case by case manually, this cannot scale to support so many APIs in MLlib. This patch will try to address this problem by serialize the data using pickle protocol, using Pyrolite library to serialize/deserialize in JVM. Pickle protocol can be easily extended to support customized class. All the modules are refactored to use this protocol. Known issues: There will be some performance regression (both CPU and memory, the serialized data increased) Author: Davies Liu <davies.liu@gmail.com> Closes #2378 from davies/pickle_mllib and squashes the following commits: dffbba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into pickle_mllib 810f97f [Davies Liu] fix equal of matrix 032cd62 [Davies Liu] add more type check and conversion for user_product bd738ab [Davies Liu] address comments e431377 [Davies Liu] fix cache of rdd, refactor 19d0967 [Davies Liu] refactor Picklers 2511e76 [Davies Liu] cleanup 1fccf1a [Davies Liu] address comments a2cc855 [Davies Liu] fix tests 9ceff73 [Davies Liu] test size of serialized Rating 44e0551 [Davies Liu] fix cache a379a81 [Davies Liu] fix pickle array in python2.7 df625c7 [Davies Liu] Merge commit '154d141' into pickle_mllib 154d141 [Davies Liu] fix autobatchedpickler 44736d7 [Davies Liu] speed up pickling array in Python 2.7 e1d1bfc [Davies Liu] refactor 708dc02 [Davies Liu] fix tests 9dcfb63 [Davies Liu] fix style 88034f0 [Davies Liu] rafactor, address comments 46a501e [Davies Liu] choose batch size automatically df19464 [Davies Liu] memorize the module and class name during pickleing f3506c5 [Davies Liu] Merge branch 'master' into pickle_mllib 722dd96 [Davies Liu] cleanup _common.py 0ee1525 [Davies Liu] remove outdated tests b02e34f [Davies Liu] remove _common.py 84c721d [Davies Liu] Merge branch 'master' into pickle_mllib 4d7963e [Davies Liu] remove muanlly serialization 6d26b03 [Davies Liu] fix tests c383544 [Davies Liu] classification f2a0856 [Davies Liu] mllib/regression d9f691f [Davies Liu] mllib/util cccb8b1 [Davies Liu] mllib/tree 8fe166a [Davies Liu] Merge branch 'pickle' into pickle_mllib aa2287e [Davies Liu] random f1544c4 [Davies Liu] refactor clustering 52d1350 [Davies Liu] use new protocol in mllib/stat b30ef35 [Davies Liu] use pickle to serialize data for mllib/recommendation f44f771 [Davies Liu] enable tests about array 3908f5c [Davies Liu] Merge branch 'master' into pickle c77c87b [Davies Liu] cleanup debugging code 60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
* [SPARK-3554] [PySpark] use broadcast automatically for large closureDavies Liu2014-09-184-3/+19
| | | | | | | | | | | Py4j can not handle large string efficiently, so we should use broadcast for large closure automatically. (Broadcast use local filesystem to pass through data). Author: Davies Liu <davies.liu@gmail.com> Closes #2417 from davies/command and squashes the following commits: fbf4e97 [Davies Liu] bugfix aefd508 [Davies Liu] use broadcast automatically for large closure
* [SPARK-3430] [PySpark] [Doc] generate PySpark API docs using SphinxDavies Liu2014-09-1613-5/+944
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using Sphinx to generate API docs for PySpark. requirement: Sphinx ``` $ cd python/docs/ $ make html ``` The generated API docs will be located at python/docs/_build/html/index.html It can co-exists with those generated by Epydoc. This is the first working version, after merging in, then we can continue to improve it and replace the epydoc finally. Author: Davies Liu <davies.liu@gmail.com> Closes #2292 from davies/sphinx and squashes the following commits: 425a3b1 [Davies Liu] cleanup 1573298 [Davies Liu] move docs to python/docs/ 5fe3903 [Davies Liu] Merge branch 'master' into sphinx 9468ab0 [Davies Liu] fix makefile b408f38 [Davies Liu] address all comments e2ccb1b [Davies Liu] update name and version 9081ead [Davies Liu] generate PySpark API docs using Sphinx
* [SPARK-2314][SQL] Override collect and take in python library, and count in ↵Aaron Staple2014-09-161-5/+42
| | | | | | | | | | | | | | | | | | java library, with optimized versions. SchemaRDD overrides RDD functions, including collect, count, and take, with optimized versions making use of the query optimizer. The java and python interface classes wrapping SchemaRDD need to ensure the optimized versions are called as well. This patch overrides relevant calls in the python and java interfaces with optimized versions. Adds a new Row serialization pathway between python and java, based on JList[Array[Byte]] versus the existing RDD[Array[Byte]]. I wasn’t overjoyed about doing this, but I noticed that some QueryPlans implement optimizations in executeCollect(), which outputs an Array[Row] rather than the typical RDD[Row] that can be shipped to python using the existing serialization code. To me it made sense to ship the Array[Row] over to python directly instead of converting it back to an RDD[Row] just for the purpose of sending the Rows to python using the existing serialization code. Author: Aaron Staple <aaron.staple@gmail.com> Closes #1592 from staple/SPARK-2314 and squashes the following commits: 89ff550 [Aaron Staple] Merge with master. 6bb7b6c [Aaron Staple] Fix typo. b56d0ac [Aaron Staple] [SPARK-2314][SQL] Override count in JavaSchemaRDD, forwarding to SchemaRDD's count. 0fc9d40 [Aaron Staple] Fix comment typos. f03cdfa [Aaron Staple] [SPARK-2314][SQL] Override collect and take in sql.py, forwarding to SchemaRDD's collect.
* [SPARK-3519] add distinct(n) to PySparkMatthew Farrellee2014-09-163-4/+24
| | | | | | | | | | | | Added missing rdd.distinct(numPartitions) and associated tests Author: Matthew Farrellee <matt@redhat.com> Closes #2383 from mattf/SPARK-3519 and squashes the following commits: 30b837a [Matthew Farrellee] Combine test cases to save on JVM startups 6bc4a2c [Matthew Farrellee] [SPARK-3519] add distinct(n) to SchemaRDD in PySpark 7a17f2b [Matthew Farrellee] [SPARK-3519] add distinct(n) to PySpark
* [SPARK-1087] Move python traceback utilities into new traceback_utils.py file.Aaron Staple2014-09-153-61/+83
| | | | | | | | | | | Also made some cosmetic cleanups. Author: Aaron Staple <aaron.staple@gmail.com> Closes #2385 from staple/SPARK-1087 and squashes the following commits: 7b3bb13 [Aaron Staple] Address review comments, cosmetic cleanups. 10ba6e1 [Aaron Staple] [SPARK-1087] Move python traceback utilities into new traceback_utils.py file.
* [SPARK-2951] [PySpark] support unpickle array.array for Python 2.6Davies Liu2014-09-152-2/+1
| | | | | | | | | | | | | | | | | Pyrolite can not unpickle array.array which pickled by Python 2.6, this patch fix it by extend Pyrolite. There is a bug in Pyrolite when unpickle array of float/double, this patch workaround it by reverse the endianness for float/double. This workaround should be removed after Pyrolite have a new release to fix this issue. I had send an PR to Pyrolite to fix it: https://github.com/irmen/Pyrolite/pull/11 Author: Davies Liu <davies.liu@gmail.com> Closes #2365 from davies/pickle and squashes the following commits: f44f771 [Davies Liu] enable tests about array 3908f5c [Davies Liu] Merge branch 'master' into pickle c77c87b [Davies Liu] cleanup debugging code 60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
* [SPARK-3516] [mllib] DecisionTree: Add minInstancesPerNode, minInfoGain ↵qiping.lqp2014-09-151-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | params to example and Python API Added minInstancesPerNode, minInfoGain params to: * DecisionTreeRunner.scala example * Python API (tree.py) Also: * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes CC: mengxr Author: qiping.lqp <qiping.lqp@alibaba-inc.com> Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Author: chouqin <liqiping1991@gmail.com> Closes #2349 from jkbradley/chouqin-dt-preprune and squashes the following commits: 61b2e72 [Joseph K. Bradley] Added max of 10GB for maxMemoryInMB in Strategy. a95e7c8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune 95c479d [Joseph K. Bradley] * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes e2628b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune 19b01af [Joseph K. Bradley] Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune f1d11d1 [chouqin] fix typo c7ebaf1 [chouqin] fix typo 39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test c6e2dfc [Joseph K. Bradley] Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py 0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1 efcc736 [qiping.lqp] fix bug 10b8012 [qiping.lqp] fix style 6728fad [qiping.lqp] minor fix: remove empty lines bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune cadd569 [qiping.lqp] add api docs 46b891f [qiping.lqp] fix bug e72c7e4 [qiping.lqp] add comments 845c6fa [qiping.lqp] fix style f195e83 [qiping.lqp] fix style 987cbf4 [qiping.lqp] fix bug ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
* [SPARK-3463] [PySpark] aggregate and show spilled bytes in PythonDavies Liu2014-09-133-14/+34
| | | | | | | | | | | | | | | | | | Aggregate the number of bytes spilled into disks during aggregation or sorting, show them in Web UI. ![spilled](https://cloud.githubusercontent.com/assets/40902/4209758/4b995562-386d-11e4-97c1-8e838ee1d4e3.png) This patch is blocked by SPARK-3465. (It includes a fix for that). Author: Davies Liu <davies.liu@gmail.com> Closes #2336 from davies/metrics and squashes the following commits: e37df38 [Davies Liu] remove outdated comments 1245eb7 [Davies Liu] remove the temporary fix ebd2f43 [Davies Liu] Merge branch 'master' into metrics 7e4ad04 [Davies Liu] Merge branch 'master' into metrics fbe9029 [Davies Liu] show spilled bytes in Python in web ui
* [SPARK-3030] [PySpark] Reuse Python workerDavies Liu2014-09-135-28/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reuse Python worker to avoid the overhead of fork() Python process for each tasks. It also tracks the broadcasts for each worker, avoid sending repeated broadcasts. This can reduce the time for dummy task from 22ms to 13ms (-40%). It can help to reduce the latency for Spark Streaming. For a job with broadcast (43M after compress): ``` b = sc.broadcast(set(range(30000000))) print sc.parallelize(range(24000), 100).filter(lambda x: x in b.value).count() ``` It will finish in 281s without reused worker, and it will finish in 65s with reused worker(4 CPUs). After reusing the worker, it can save about 9 seconds for transfer and deserialize the broadcast for each tasks. It's enabled by default, could be disabled by `spark.python.worker.reuse = false`. Author: Davies Liu <davies.liu@gmail.com> Closes #2259 from davies/reuse-worker and squashes the following commits: f11f617 [Davies Liu] Merge branch 'master' into reuse-worker 3939f20 [Davies Liu] fix bug in serializer in mllib cf1c55e [Davies Liu] address comments 3133a60 [Davies Liu] fix accumulator with reused worker 760ab1f [Davies Liu] do not reuse worker if there are any exceptions 7abb224 [Davies Liu] refactor: sychronized with itself ac3206e [Davies Liu] renaming 8911f44 [Davies Liu] synchronized getWorkerBroadcasts() 6325fc1 [Davies Liu] bugfix: bid >= 0 e0131a2 [Davies Liu] fix name of config 583716e [Davies Liu] only reuse completed and not interrupted worker ace2917 [Davies Liu] kill python worker after timeout 6123d0f [Davies Liu] track broadcasts for each worker 8d2f08c [Davies Liu] reuse python worker
* [SPARK-3500] [SQL] use JavaSchemaRDD as SchemaRDD._jschema_rddDavies Liu2014-09-122-20/+46
| | | | | | | | | | | | | | Currently, SchemaRDD._jschema_rdd is SchemaRDD, the Scala API (coalesce(), repartition()) can not been called in Python easily, there is no way to specify the implicit parameter `ord`. The _jrdd is an JavaRDD, so _jschema_rdd should also be JavaSchemaRDD. In this patch, change _schema_rdd to JavaSchemaRDD, also added an assert for it. If some methods are missing from JavaSchemaRDD, then it's called by _schema_rdd.baseSchemaRDD().xxx(). BTW, Do we need JavaSQLContext? Author: Davies Liu <davies.liu@gmail.com> Closes #2369 from davies/fix_schemardd and squashes the following commits: abee159 [Davies Liu] use JavaSchemaRDD as SchemaRDD._jschema_rdd
* [SPARK-3094] [PySpark] compatitable with PyPyDavies Liu2014-09-125-118/+172
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After this patch, we can run PySpark in PyPy (testing with PyPy 2.3.1 in Mac 10.9), for example: ``` PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py ``` The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks: Job | CPython 2.7 | PyPy 2.3.1 | Speed up ------- | ------------ | ------------- | ------- Word Count | 41s | 15s | 2.7x Sort | 46s | 44s | 1.05x Stats | 174s | 3.6s | 48x Here is the code used for benchmark: ```python rdd = sc.textFile("text") def wordcount(): rdd.flatMap(lambda x:x.split('/'))\ .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap() def sort(): rdd.sortBy(lambda x:x, 1).count() def stats(): sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats() ``` Author: Davies Liu <davies.liu@gmail.com> Closes #2144 from davies/pypy and squashes the following commits: 9aed6c5 [Davies Liu] use protocol 2 in CloudPickle 4bc1f04 [Davies Liu] refactor b20ab3a [Davies Liu] pickle sys.stdout and stderr in portable way 3ca2351 [Davies Liu] Merge branch 'master' into pypy fae8b19 [Davies Liu] improve attrgetter, add tests 591f830 [Davies Liu] try to run tests with PyPy in run-tests c8d62ba [Davies Liu] cleanup f651fd0 [Davies Liu] fix tests using array with PyPy 1b98fb3 [Davies Liu] serialize itemgetter/attrgetter in portable ways 3c1dbfe [Davies Liu] Merge branch 'master' into pypy 42fb5fa [Davies Liu] Merge branch 'master' into pypy cb2d724 [Davies Liu] fix tests 9986692 [Davies Liu] Merge branch 'master' into pypy 25b4ca7 [Davies Liu] support PyPy
* [PySpark] Add blank line so that Python RDD.top() docstring renders correctlyRJ Nowling2014-09-121-0/+1
| | | | | | | | Author: RJ Nowling <rnowling@gmail.com> Closes #2370 from rnowling/python_rdd_docstrings and squashes the following commits: 5230574 [RJ Nowling] Add blank line so that Python RDD.top() docstring renders correctly
* [SPARK-3047] [PySpark] add an option to use str in textFileRDDDavies Liu2014-09-112-11/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | str is much efficient than unicode (both CPU and memory), it'e better to use str in textFileRDD. In order to keep compatibility, use unicode by default. (Maybe change it in the future). use_unicode=True: daviesliudm:~/work/spark$ time python wc.py (u'./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776) real 2m8.298s user 0m0.185s sys 0m0.064s use_unicode=False daviesliudm:~/work/spark$ time python wc.py ('./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776) real 1m26.402s user 0m0.182s sys 0m0.062s We can see that it got 32% improvement! Author: Davies Liu <davies.liu@gmail.com> Closes #1951 from davies/unicode and squashes the following commits: 8352d57 [Davies Liu] update version number a286f2f [Davies Liu] rollback loads() 85246e5 [Davies Liu] add docs for use_unicode a0295e1 [Davies Liu] add an option to use str in textFile()
* [SPARK-3458] enable python "with" statements for SparkContextMatthew Farrellee2014-09-092-0/+43
| | | | | | | | | | | | | | | | | | | | | | | | | allow for best practice code, ``` try: sc = SparkContext() app(sc) finally: sc.stop() ``` to be written using a "with" statement, ``` with SparkContext() as sc: app(sc) ``` Author: Matthew Farrellee <matt@redhat.com> Closes #2335 from mattf/SPARK-3458 and squashes the following commits: 5b4e37c [Matthew Farrellee] [SPARK-3458] enable python "with" statements for SparkContext
* [SPARK-3443][MLLIB] update default values of tree:Xiangrui Meng2014-09-081-2/+2
| | | | | | | | | | | | | | | | | | Adjust the default values of decision tree, based on the memory requirement discussed in https://github.com/apache/spark/pull/2125 : 1. maxMemoryInMB: 128 -> 256 2. maxBins: 100 -> 32 3. maxDepth: 4 -> 5 (in some example code) jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #2322 from mengxr/tree-defaults and squashes the following commits: cda453a [Xiangrui Meng] fix tests 5900445 [Xiangrui Meng] update comments 8c81831 [Xiangrui Meng] update default values of tree:
* [SPARK-3417] Use new-style classes in PySparkMatthew Rocklin2014-09-084-4/+4
| | | | | | | | | | | | | | | | | | Tiny PR making SQLContext a new-style class. This allows various type logic to work more effectively ```Python In [1]: import pyspark In [2]: pyspark.sql.SQLContext.mro() Out[2]: [pyspark.sql.SQLContext, object] ``` Author: Matthew Rocklin <mrocklin@gmail.com> Closes #2288 from mrocklin/sqlcontext-new-style-class and squashes the following commits: 4aadab6 [Matthew Rocklin] update other old-style classes a2dc02f [Matthew Rocklin] pyspark.sql.SQLContext is new-style class
* Provide a default PYSPARK_PYTHON for python/run_testsMatthew Farrellee2014-09-081-0/+2
| | | | | | | | | | | | | | Without this the version of python used in the test is not recorded. The error is, Testing with Python version: ./run-tests: line 57: --version: command not found Author: Matthew Farrellee <matt@redhat.com> Closes #2300 from mattf/master-fix-python-run-tests and squashes the following commits: 65a09f5 [Matthew Farrellee] Provide a default PYSPARK_PYTHON for python/run_tests
* SPARK-2978. Transformation with MR shuffle semanticsSandy Ryza2014-09-082-0/+32
| | | | | | | | | | | | | | | | | | I didn't add this to the transformations list in the docs because it's kind of obscure, but would be happy to do so if others think it would be helpful. Author: Sandy Ryza <sandy@cloudera.com> Closes #2274 from sryza/sandy-spark-2978 and squashes the following commits: 4a5332a [Sandy Ryza] Fix Java test c04b447 [Sandy Ryza] Fix Python doc and add back deleted code 433ad5b [Sandy Ryza] Add Java test 4c25a54 [Sandy Ryza] Add s at the end and a couple other fixes 9b0ba99 [Sandy Ryza] Fix compilation 36e0571 [Sandy Ryza] Fix import ordering 48c12c2 [Sandy Ryza] Add Java version and additional doc e5381cd [Sandy Ryza] Fix python style warnings f147634 [Sandy Ryza] SPARK-2978. Transformation with MR shuffle semantics
* SPARK-3337 Paranoid quoting in shell to allow install dirs with spaces within.Prashant Sharma2014-09-081-2/+4
| | | | | | | | | | | | ... Tested ! TBH, it isn't a great idea to have directory with spaces within. Because emacs doesn't like it then hadoop doesn't like it. and so on... Author: Prashant Sharma <prashant.s@imaginea.com> Closes #2229 from ScrapCodes/SPARK-3337/quoting-shell-scripts and squashes the following commits: d4ad660 [Prashant Sharma] SPARK-3337 Paranoid quoting in shell to allow install dirs with spaces within.
* [SPARK-3415] [PySpark] removes SerializingAdapter codeWard Viaene2014-09-072-5/+12
| | | | | | | | | | | | | | | | This code removes the SerializingAdapter code that was copied from PiCloud Author: Ward Viaene <ward.viaene@bigdatapartnership.com> Closes #2287 from wardviaene/feature/pythonsys and squashes the following commits: 5f0d426 [Ward Viaene] SPARK-3415: modified test class to do dump and load 5f5d559 [Ward Viaene] SPARK-3415: modified test class name and call cloudpickle.dumps instead using StringIO afc4a9a [Ward Viaene] SPARK-3415: added newlines to pass lint aaf10b7 [Ward Viaene] SPARK-3415: removed references to SerializingAdapter and rewrote test 65ffeff [Ward Viaene] removed duplicate test a958866 [Ward Viaene] SPARK-3415: test script e263bf5 [Ward Viaene] SPARK-3415: removes legacy SerializingAdapter code
* [SPARK-2334] fix AttributeError when call PipelineRDD.id()Davies Liu2014-09-063-4/+20
| | | | | | | | | | | | | | | The underline JavaRDD for PipelineRDD is created lazily, it's delayed until call _jrdd. The id of JavaRDD is cached as `_id`, it saves a RPC call in py4j for later calls. closes #1276 Author: Davies Liu <davies.liu@gmail.com> Closes #2296 from davies/id and squashes the following commits: e197958 [Davies Liu] fix style 9721716 [Davies Liu] fix id of PipelineRDD
* [SPARK-3273][SPARK-3301]We should read the version information from the same ↵GuoQiang Li2014-09-061-2/+2
| | | | | | | | | | | place Author: GuoQiang Li <witgo@qq.com> Closes #2175 from witgo/SPARK-3273 and squashes the following commits: cf9c65a [GuoQiang Li] We should read the version information from the same place 2a44e2f [GuoQiang Li] The spark version in the welcome message of pyspark is not correct
* Spark-3406 add a default storage level to python RDD persist APIHolden Karau2014-09-062-2/+8
| | | | | | | | | | Author: Holden Karau <holden@pigscanfly.ca> Closes #2280 from holdenk/SPARK-3406-Python-RDD-persist-api-does-not-have-default-storage-level and squashes the following commits: 33eaade [Holden Karau] As Josh pointed out, sql also override persist. Make persist behave the same as in the underlying RDD as well e658227 [Holden Karau] Fix the test I added e95a6c5 [Holden Karau] The Python persist function did not have a default storageLevel unlike the Scala API. Noticed this issue because we got a bug report back from the book where we had documented it as if it was the same as the Scala API
* SPARK-3211 .take() is OOM-prone with empty partitionsAndrew Ash2014-09-051-4/+4
| | | | | | | | | | | | | | | | Instead of jumping straight from 1 partition to all partitions, do exponential growth and double the number of partitions to attempt each time instead. Fix proposed by Paul Nepywoda Author: Andrew Ash <andrew@andrewash.com> Closes #2117 from ash211/SPARK-3211 and squashes the following commits: 8b2299a [Andrew Ash] Quadruple instead of double for a minor speedup e5f7e4d [Andrew Ash] Update comment to better reflect what we're doing 09a27f7 [Andrew Ash] Update PySpark to be less OOM-prone as well 3a156b8 [Andrew Ash] SPARK-3211 .take() is OOM-prone with empty partitions
* [SPARK-3378] [DOCS] Replace the word "SparkSQL" with right word "Spark SQL"Kousuke Saruta2014-09-042-4/+4
| | | | | | | | | | Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2251 from sarutak/SPARK-3378 and squashes the following commits: 0bfe234 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3378 bb5938f [Kousuke Saruta] Replaced rest of "SparkSQL" with "Spark SQL" 6df66de [Kousuke Saruta] Replaced "SparkSQL" with "Spark SQL"
* [SPARK-3401][PySpark] Wrong usage of tee command in python/run-testsKousuke Saruta2014-09-041-1/+1
| | | | | | | | Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2272 from sarutak/SPARK-3401 and squashes the following commits: 2b35a59 [Kousuke Saruta] Modified wrong usage of tee command in python/run-tests
* [SPARK-2435] Add shutdown hook to pysparkMatthew Farrellee2014-09-031-0/+2
| | | | | | | | Author: Matthew Farrellee <matt@redhat.com> Closes #2183 from mattf/SPARK-2435 and squashes the following commits: ee0ee99 [Matthew Farrellee] [SPARK-2435] Add shutdown hook to pyspark
* [SPARK-3335] [SQL] [PySpark] support broadcast in Python UDFDavies Liu2014-09-032-8/+31
| | | | | | | | | | After this patch, broadcast can be used in Python UDF. Author: Davies Liu <davies.liu@gmail.com> Closes #2243 from davies/udf_broadcast and squashes the following commits: 7b88861 [Davies Liu] support broadcast in UDF
* [SPARK-3309] [PySpark] Put all public API in __all__Davies Liu2014-09-0317-26/+81
| | | | | | | | | | | | | | Put all public API in __all__, also put them all in pyspark.__init__.py, then we can got all the documents for public API by `pydoc pyspark`. It also can be used by other programs (such as Sphinx or Epydoc) to generate only documents for public APIs. Author: Davies Liu <davies.liu@gmail.com> Closes #2205 from davies/public and squashes the following commits: c6c5567 [Davies Liu] fix message f7b35be [Davies Liu] put SchemeRDD, Row in pyspark.sql module 7e3016a [Davies Liu] add __all__ in mllib 6281b48 [Davies Liu] fix doc for SchemaRDD 6caab21 [Davies Liu] add public interfaces into pyspark.__init__.py
* [SPARK-2871] [PySpark] add countApproxDistinct() APIDavies Liu2014-09-022-5/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RDD.countApproxDistinct(relativeSD=0.05): :: Experimental :: Return approximate number of distinct elements in the RDD. The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>. This support all the types of objects, which is supported by Pyrolite, nearly all builtin types. param relativeSD Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017. >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct() >>> 950 < n < 1050 True >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct() >>> 18 < n < 22 True Author: Davies Liu <davies.liu@gmail.com> Closes #2142 from davies/countApproxDistinct and squashes the following commits: e20da47 [Davies Liu] remove the correction in Python c38c4e4 [Davies Liu] fix doc tests 2ab157c [Davies Liu] fix doc tests 9d2565f [Davies Liu] add commments and link for hash collision correction d306492 [Davies Liu] change range of hash of tuple to [0, maxint] ded624f [Davies Liu] calculate hash in Python 4cba98f [Davies Liu] add more tests a85a8c6 [Davies Liu] Merge branch 'master' into countApproxDistinct e97e342 [Davies Liu] add countApproxDistinct()
* SPARK-3318: Documentation update in addFile on how to use SparkFiles.getHolden Karau2014-08-301-2/+2
| | | | | | | | | | | Rather than specifying the path to SparkFiles we need to use the filename. Author: Holden Karau <holden@pigscanfly.ca> Closes #2210 from holdenk/SPARK-3318-documentation-for-addfiles-should-say-to-use-file-not-path and squashes the following commits: a25d27a [Holden Karau] Update the JavaSparkContext addFile method to be clear about using fileName with SparkFiles as well 0ebcb05 [Holden Karau] Documentation update in addFile on how to use SparkFiles.get to specify filename rather than path
* [SPARK-3307] [PySpark] Fix doc string of SparkContext.broadcast()Davies Liu2014-08-291-2/+0
| | | | | | | | | | remove invalid docs Author: Davies Liu <davies.liu@gmail.com> Closes #2202 from davies/keep and squashes the following commits: aa3b44f [Davies Liu] remove invalid docs
* [SPARK-2871] [PySpark] add RDD.lookup(key)Davies Liu2014-08-271-132/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | RDD.lookup(key) Return the list of values in the RDD for key `key`. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. >>> l = range(1000) >>> rdd = sc.parallelize(zip(l, l), 10) >>> rdd.lookup(42) # slow [42] >>> sorted = rdd.sortByKey() >>> sorted.lookup(42) # fast [42] It also clean up the code in RDD.py, and fix several bugs (related to preservesPartitioning). Author: Davies Liu <davies.liu@gmail.com> Closes #2093 from davies/lookup and squashes the following commits: 1789cd4 [Davies Liu] `f` in foreach could be generator or not. 2871b80 [Davies Liu] Merge branch 'master' into lookup c6390ea [Davies Liu] address all comments 0f1bce8 [Davies Liu] add test case for lookup() be0e8ba [Davies Liu] fix preservesPartitioning eb1305d [Davies Liu] add RDD.lookup(key)
* [SPARK-3167] Handle special driver configs in WindowsAndrew Or2014-08-261-0/+17
| | | | | | | | | | | | | | | | | | | This is an effort to bring the Windows scripts up to speed after recent splashing changes in #1845. Author: Andrew Or <andrewor14@gmail.com> Closes #2129 from andrewor14/windows-config and squashes the following commits: 881a8f0 [Andrew Or] Add reference to Windows taskkill 92e6047 [Andrew Or] Update a few comments (minor) 22b1acd [Andrew Or] Fix style again (minor) afcffea [Andrew Or] Fix style (minor) 72004c2 [Andrew Or] Actually respect --driver-java-options 803218b [Andrew Or] Actually respect SPARK_*_CLASSPATH eeb34a0 [Andrew Or] Update outdated comment (minor) 35caecc [Andrew Or] In Windows, actually kill Java processes on exit f97daa2 [Andrew Or] Fix Windows spark shell stdin issue 83ebe60 [Andrew Or] Parse special driver configs in Windows (broken)
* [SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey()Davies Liu2014-08-264-11/+1021
| | | | | | | | | | | | | | | | | Using external sort to support sort large datasets in reduce stage. Author: Davies Liu <davies.liu@gmail.com> Closes #1978 from davies/sort and squashes the following commits: bbcd9ba [Davies Liu] check spilled bytes in tests b125d2f [Davies Liu] add test for external sort in rdd eae0176 [Davies Liu] choose different disks from different processes and instances 1f075ed [Davies Liu] Merge branch 'master' into sort eb53ca6 [Davies Liu] Merge branch 'master' into sort 644abaf [Davies Liu] add license in LICENSE 19f7873 [Davies Liu] improve tests 55602ee [Davies Liu] use external sort in sortBy() and sortByKey()
* [SPARK-2969][SQL] Make ScalaReflection be able to handle ↵Takuya UESHIN2014-08-261-3/+3
| | | | | | | | | | | | | | | | | | | | | | | ArrayType.containsNull and MapType.valueContainsNull. Make `ScalaReflection` be able to handle like: - `Seq[Int]` as `ArrayType(IntegerType, containsNull = false)` - `Seq[java.lang.Integer]` as `ArrayType(IntegerType, containsNull = true)` - `Map[Int, Long]` as `MapType(IntegerType, LongType, valueContainsNull = false)` - `Map[Int, java.lang.Long]` as `MapType(IntegerType, LongType, valueContainsNull = true)` Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1889 from ueshin/issues/SPARK-2969 and squashes the following commits: 24f1c5c [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Python API. 79f5b65 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Java API. 7cd1a7a [Takuya UESHIN] Fix json test failures. 2cfb862 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true. 2f38e61 [Takuya UESHIN] Revert the default value of MapTypes.valueContainsNull. 9fa02f5 [Takuya UESHIN] Fix a test failure. 1a9a96b [Takuya UESHIN] Modify ScalaReflection to handle ArrayType.containsNull and MapType.valueContainsNull.
* [SPARK-2871] [PySpark] add histgram() APIDavies Liu2014-08-262-1/+232
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RDD.histogram(buckets) Compute a histogram using the provided buckets. The buckets are all open to the right except for the last which is closed. e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50], which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1 and 50 we would have a histogram of 1,0,1. If your histogram is evenly spaced (e.g. [0, 10, 20, 30]), this can be switched from an O(log n) inseration to O(1) per element(where n = # buckets). Buckets must be sorted and not contain any duplicates, must be at least two elements. If `buckets` is a number, it will generates buckets which is evenly spaced between the minimum and maximum of the RDD. For example, if the min value is 0 and the max is 100, given buckets as 2, the resulting buckets will be [0,50) [50,100]. buckets must be at least 1 If the RDD contains infinity, NaN throws an exception If the elements in RDD do not vary (max == min) always returns a single bucket. It will return an tuple of buckets and histogram. >>> rdd = sc.parallelize(range(51)) >>> rdd.histogram(2) ([0, 25, 50], [25, 26]) >>> rdd.histogram([0, 5, 25, 50]) ([0, 5, 25, 50], [5, 20, 26]) >>> rdd.histogram([0, 15, 30, 45, 60], True) ([0, 15, 30, 45, 60], [15, 15, 15, 6]) >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"]) >>> rdd.histogram(("a", "b", "c")) (('a', 'b', 'c'), [2, 2]) closes #122, it's duplicated. Author: Davies Liu <davies.liu@gmail.com> Closes #2091 from davies/histgram and squashes the following commits: a322f8a [Davies Liu] fix deprecation of e.message 84e85fa [Davies Liu] remove evenBuckets, add more tests (including str) d9a0722 [Davies Liu] address comments 0e18a2d [Davies Liu] add histgram() API
* [SPARK-2871] [PySpark] add zipWithIndex() and zipWithUniqueId()Davies Liu2014-08-241-0/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RDD.zipWithIndex() Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This method needs to trigger a spark job when this RDD contains more than one partitions. >>> sc.parallelize(range(4), 2).zipWithIndex().collect() [(0, 0), (1, 1), (2, 2), (3, 3)] RDD.zipWithUniqueId() Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k, 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method won't trigger a spark job, which is different from L{zipWithIndex} >>> sc.parallelize(range(4), 2).zipWithUniqueId().collect() [(0, 0), (2, 1), (1, 2), (3, 3)] Author: Davies Liu <davies.liu@gmail.com> Closes #2092 from davies/zipWith and squashes the following commits: cebe5bf [Davies Liu] improve test cases, reverse the order of index 0d2a128 [Davies Liu] add zipWithIndex() and zipWithUniqueId()
* [SPARK-2871] [PySpark] add approx API for RDDDavies Liu2014-08-231-0/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RDD.countApprox(self, timeout, confidence=0.95) :: Experimental :: Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished. >>> rdd = sc.parallelize(range(1000), 10) >>> rdd.countApprox(1000, 1.0) 1000 RDD.sumApprox(self, timeout, confidence=0.95) Approximate operation to return the sum within a timeout or meet the confidence. >>> rdd = sc.parallelize(range(1000), 10) >>> r = sum(xrange(1000)) >>> (rdd.sumApprox(1000) - r) / r < 0.05 RDD.meanApprox(self, timeout, confidence=0.95) :: Experimental :: Approximate operation to return the mean within a timeout or meet the confidence. >>> rdd = sc.parallelize(range(1000), 10) >>> r = sum(xrange(1000)) / 1000.0 >>> (rdd.meanApprox(1000) - r) / r < 0.05 True Author: Davies Liu <davies.liu@gmail.com> Closes #2095 from davies/approx and squashes the following commits: e8c252b [Davies Liu] add approx API for RDD
* [SPARK-2871] [PySpark] add `key` argument for max(), min() and top(n)Davies Liu2014-08-231-17/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RDD.max(key=None) param key: A function used to generate key for comparing >>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0]) >>> rdd.max() 43.0 >>> rdd.max(key=str) 5.0 RDD.min(key=None) Find the minimum item in this RDD. param key: A function used to generate key for comparing >>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0]) >>> rdd.min() 2.0 >>> rdd.min(key=str) 10.0 RDD.top(num, key=None) Get the top N elements from a RDD. Note: It returns the list sorted in descending order. >>> sc.parallelize([10, 4, 2, 12, 3]).top(1) [12] >>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2) [6, 5] >>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str) [4, 3, 2] Author: Davies Liu <davies.liu@gmail.com> Closes #2094 from davies/cmp and squashes the following commits: ccbaf25 [Davies Liu] add `key` to top() ad7e374 [Davies Liu] fix tests 2f63512 [Davies Liu] change `comp` to `key` in min/max dd91e08 [Davies Liu] add `comp` argument for RDD.max() and RDD.min()
* [SPARK-3140] Clarify confusing PySpark exception messageAndrew Or2014-08-201-3/+10
| | | | | | | | | | | | | We read the py4j port from the stdout of the `bin/spark-submit` subprocess. If there is interference in stdout (e.g. a random echo in `spark-submit`), we throw an exception with a warning message. We do not, however, distinguish between this case from the case where no stdout is produced at all. I wasted a non-trivial amount of time being baffled by this exception in search of places where I print random whitespace (in vain, of course). A clearer exception message that distinguishes between these cases will prevent similar headaches that I have gone through. Author: Andrew Or <andrewor14@gmail.com> Closes #2067 from andrewor14/python-exception and squashes the following commits: 742f823 [Andrew Or] Further clarify warning messages e96a7a0 [Andrew Or] Distinguish between unexpected output and no output at all