aboutsummaryrefslogtreecommitdiff
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-5969][PySpark] Fix descending pyspark.rdd.sortByKey.Milan Straka2015-04-102-1/+12
| | | | | | | | | | | | | | | | The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved in rangePartitioner by reversing the found index. The current implementation also work, but always uses only two partitions -- the first one and the last one (because the bisect_left return returns either "beginning" or "end" for a descending sequence). Author: Milan Straka <fox@ucw.cz> This patch had conflicts when merged, resolved by Committer: Josh Rosen <joshrosen@databricks.com> Closes #4761 from foxik/fix-descending-sort and squashes the following commits: 95896b5 [Milan Straka] Add regression test for SPARK-5969. 5757490 [Milan Straka] Fix descending pyspark.rdd.sortByKey.
* [SPARK-6781] [SQL] use sqlContext in python shellDavies Liu2015-04-087-53/+52
| | | | | | | | | | | | | | Use `sqlContext` in PySpark shell, make it consistent with SQL programming guide. `sqlCtx` is also kept for compatibility. Author: Davies Liu <davies@databricks.com> Closes #5425 from davies/sqlCtx and squashes the following commits: af67340 [Davies Liu] sqlCtx -> sqlContext 15a278f [Davies Liu] use sqlContext in python shell (cherry picked from commit 6ada4f6f52cf1d992c7ab0c32318790cf08b0a0d) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6506] [pyspark] Do not try to retrieve SPARK_HOME when not needed...Marcelo Vanzin2015-04-081-2/+1
| | | | | | | | | | | | | | | | .... In particular, this makes pyspark in yarn-cluster mode fail unless SPARK_HOME is set, when it's not really needed. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5405 from vanzin/SPARK-6506 and squashes the following commits: e184507 [Marcelo Vanzin] [SPARK-6506] [pyspark] Do not try to retrieve SPARK_HOME when not needed. (cherry picked from commit f7e21dd1ec4541be54eb01d8b15cfcc6714feed0) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [SPARK-6667] [PySpark] remove setReuseAddressDavies Liu2015-04-021-0/+1
| | | | | | | | | | | | | | | | | The reused address on server side had caused the server can not acknowledge the connected connections, remove it. This PR will retry once after timeout, it also add a timeout at client side. Author: Davies Liu <davies@databricks.com> Closes #5324 from davies/collect_hang and squashes the following commits: e5a51a2 [Davies Liu] remove setReuseAddress 7977c2f [Davies Liu] do retry on client side b838f35 [Davies Liu] retry after timeout (cherry picked from commit 0cce5451adfc6bf4661bcf67aca3db26376455fe) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [SPARK-6660][MLLIB] pythonToJava doesn't recognize object arraysXiangrui Meng2015-04-011-0/+8
| | | | | | | | | | | | | | | | | davies Author: Xiangrui Meng <meng@databricks.com> Closes #5318 from mengxr/SPARK-6660 and squashes the following commits: 0f66ec2 [Xiangrui Meng] recognize object arrays ad8c42f [Xiangrui Meng] add a test for SPARK-6660 (cherry picked from commit 4815bc2128c7f6d4d21da730b8c72da087233b34) Signed-off-by: Xiangrui Meng <meng@databricks.com> Conflicts: python/pyspark/mllib/tests.py
* [SPARK-6553] [pyspark] Support functools.partial as UDFksonj2015-04-012-1/+33
| | | | | | | | | | | | | Use `f.__repr__()` instead of `f.__name__` when instantiating `UserDefinedFunction`s, so `functools.partial`s may be used. Author: ksonj <kson@siberie.de> Closes #5206 from ksonj/partials and squashes the following commits: ea66f3d [ksonj] Inserted blank lines for PEP8 compliance d81b02b [ksonj] added tests for udf with partial function and callable object 2c76100 [ksonj] Makes UDFs work with all types of callables b814a12 [ksonj] support functools.partial as udf
* [SPARK-6642][MLLIB] use 1.2 lambda scaling and remove addImplicit from ↵Xiangrui Meng2015-04-011-3/+3
| | | | | | | | | | | | | | | | | | | NormalEquation This PR changes lambda scaling from number of users/items to number of explicit ratings. The latter is the behavior in 1.2. Slight refactor of NormalEquation to make it independent of ALS models. srowen codexiang Author: Xiangrui Meng <meng@databricks.com> Closes #5314 from mengxr/SPARK-6642 and squashes the following commits: dc655a1 [Xiangrui Meng] relax python tests f410df2 [Xiangrui Meng] use 1.2 scaling and remove addImplicit from NormalEquation (cherry picked from commit ccafd757eda478913f783f3127be715bf6413740) Signed-off-by: Xiangrui Meng <meng@databricks.com> Conflicts: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
* [SPARK-6657] [Python] [Docs] fixed python doc build warningsJoseph K. Bradley2015-04-012-17/+11
| | | | | | | | | | | | | | | fixed python doc build warnings CC whomever wants to review: rxin mengxr davies Author: Joseph K. Bradley <joseph@databricks.com> Closes #5317 from jkbradley/python-doc-warnings and squashes the following commits: 4cd43c2 [Joseph K. Bradley] fixed python doc build warnings (cherry picked from commit fb25e8c7f45b4f96561e3f7434a0f4dfce8ddefe) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6651][MLLIB] delegate dense vector arithmetics to the underlying ↵Xiangrui Meng2015-04-011-1/+37
| | | | | | | | | | | | | | | | numpy array Users should be able to use numpy operators directly on dense vectors. davies atalwalkar Author: Xiangrui Meng <meng@databricks.com> Closes #5312 from mengxr/SPARK-6651 and squashes the following commits: e665c5c [Xiangrui Meng] wrap the result in a dense vector 23dfca3 [Xiangrui Meng] delegate dense vector arithmetics to the underlying numpy array (cherry picked from commit 2275acce7ba5fac83c58554d7ee9f4c7f3e866cf) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [Doc] Improve Python DataFrame documentationReynold Xin2015-03-315-390/+250
| | | | | | | | | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #5287 from rxin/pyspark-df-doc-cleanup-context and squashes the following commits: 1841b60 [Reynold Xin] Lint. f2007f1 [Reynold Xin] functions and types. bc3b72b [Reynold Xin] More improvements to DataFrame Python doc. ac1d4c0 [Reynold Xin] Bug fix. b163365 [Reynold Xin] Python fix. Added Experimental flag to DataFrameNaFunctions. 608422d [Reynold Xin] [Doc] Cleanup context.py Python docs. (cherry picked from commit 305abe1e57450f49e3ec4dffb073c5adf17cadef) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6623][SQL] Alias DataFrame.na.drop and DataFrame.na.fill in Python.Reynold Xin2015-03-312-6/+45
| | | | | | | | | | | | | | To maintain consistency with the Scala API. Author: Reynold Xin <rxin@databricks.com> Closes #5284 from rxin/df-na-alias and squashes the following commits: 19f46b7 [Reynold Xin] Show DataFrameNaFunctions in docs. 6618118 [Reynold Xin] [SPARK-6623][SQL] Alias DataFrame.na.drop and DataFrame.na.fill in Python. (cherry picked from commit b80a030e90d790e27e89b26f536565c582dbf3d5) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6119][SQL] DataFrame support for missing data handlingReynold Xin2015-03-302-0/+182
| | | | | | | | | | | | | | | | | | | | | | | This pull request adds variants of DataFrame.na.drop and DataFrame.na.fill to the Scala/Java API, and DataFrame.fillna and DataFrame.dropna to the Python API. Author: Reynold Xin <rxin@databricks.com> Closes #5274 from rxin/df-missing-value and squashes the following commits: 4ee1b98 [Reynold Xin] Improve error reporting in Python. 33a330c [Reynold Xin] Remove replace for now. bc4fdbb [Reynold Xin] Added documentation for replace. d56f5a5 [Reynold Xin] Added replace for Scala/Java. 2385d00 [Reynold Xin] Feedback from Xiangrui on "how". 914a374 [Reynold Xin] fill with map. 185c67e [Reynold Xin] Allow specifying column subsets in fill. 749eb47 [Reynold Xin] fillna 249b94e [Reynold Xin] Removing undefined functions. 6a73c68 [Reynold Xin] Missing file. 67d7003 [Reynold Xin] [SPARK-6119][SQL] DataFrame.na.drop (Scala/Java) and DataFrame.dropna (Python) (cherry picked from commit b8ff2bc61c9835867f56afa1860ab5eb727c4a58) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6603] [PySpark] [SQL] add SQLContext.udf and deprecate inferSchema() ↵Davies Liu2015-03-301-27/+60
| | | | | | | | | | | | | | | | | | | | and applySchema This PR create an alias for `registerFunction` as `udf.register`, to be consistent with Scala API. It also deprecated inferSchema() and applySchema(), show an warning for them. cc rxin Author: Davies Liu <davies@databricks.com> Closes #5273 from davies/udf and squashes the following commits: 476e947 [Davies Liu] address comments c096fdb [Davies Liu] add SQLContext.udf and deprecate inferSchema() and applySchema (cherry picked from commit f76d2e55b1a67bf5576e1aa001a0b872b9b3895a) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6571][MLLIB] use wrapper in MatrixFactorizationModel.loadXiangrui Meng2015-03-301-0/+8
| | | | | | | | | | | | | | | | This fixes `predictAll` after load. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #5243 from mengxr/SPARK-6571 and squashes the following commits: 82dcaa7 [Xiangrui Meng] use wrapper in MatrixFactorizationModel.load (cherry picked from commit f75f633b21faaf911f04aeff847f25749b1ecd89) Signed-off-by: Xiangrui Meng <meng@databricks.com> Conflicts: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
* [DOC] Improvements to Python docs.Reynold Xin2015-03-283-14/+17
| | | | | | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #5238 from rxin/pyspark-docs and squashes the following commits: c285951 [Reynold Xin] Reset deprecation warning. 8c1031e [Reynold Xin] inferSchema dd91b1a [Reynold Xin] [DOC] Improvements to Python docs. (cherry picked from commit 5eef00d0c6c7cc5448aca7b1c2a2e289a4c43eb0) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6117] [SQL] Improvements to DataFrame.describe()Reynold Xin2015-03-261-0/+19
| | | | | | | | | | | | | | | | 1. Slightly modifications to the code to make it more readable. 2. Added Python implementation. 3. Updated the documentation to state that we don't guarantee the output schema for this function and it should only be used for exploratory data analysis. Author: Reynold Xin <rxin@databricks.com> Closes #5201 from rxin/df-describe and squashes the following commits: 25a7834 [Reynold Xin] Reset run-tests. 6abdfee [Reynold Xin] [SPARK-6117] [SQL] Improvements to DataFrame.describe() (cherry picked from commit 784fcd532784fcfd9bf0a1db71c9f71c469ee716) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6536] [PySpark] Column.inSet() in PythonDavies Liu2015-03-261-0/+17
| | | | | | | | | | | | | | | | | | ``` >>> df[df.name.inSet("Bob", "Mike")].collect() [Row(age=5, name=u'Bob')] >>> df[df.age.inSet([1, 2, 3])].collect() [Row(age=2, name=u'Alice')] ``` Author: Davies Liu <davies@databricks.com> Closes #5190 from davies/in and squashes the following commits: 6b73a47 [Davies Liu] Column.inSet() in Python (cherry picked from commit f535802977c5a3ce45894d89fdf59f8723f023c8) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6421][MLLIB] _regression_train_wrapper does not test initialWeights ↵lewuathe2015-03-202-1/+9
| | | | | | | | | | | | | | | | correctly Weight parameters must be initialized correctly even when numpy array is passed as initial weights. Author: lewuathe <lewuathe@me.com> Closes #5101 from Lewuathe/SPARK-6421 and squashes the following commits: 7795201 [lewuathe] Fix lint-python errors 21d4fe3 [lewuathe] Fix init logic of weights (cherry picked from commit 257cde7c363efb3317bfb5c13975cca9154894e2) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-6366][SQL] In Python API, the default save mode for save and ↵Yin Huai2015-03-181-2/+2
| | | | | | | | | | | | | | | saveAsTable should be "error" instead of "append". https://issues.apache.org/jira/browse/SPARK-6366 Author: Yin Huai <yhuai@databricks.com> Closes #5053 from yhuai/SPARK-6366 and squashes the following commits: fc81897 [Yin Huai] Use error as the default save mode for save/saveAsTable. (cherry picked from commit dc9c9196d63aa465e86ac52f0e86e10c12472100) Signed-off-by: Cheng Lian <lian@databricks.com>
* [SPARK-6210] [SQL] use prettyString as column name in agg()Davies Liu2015-03-141-16/+16
| | | | | | | | | | | | | use prettyString instead of toString() (which include id of expression) as column name in agg() Author: Davies Liu <davies@databricks.com> Closes #5006 from davies/prettystring and squashes the following commits: cb1fdcf [Davies Liu] use prettyString as column name in agg() (cherry picked from commit b38e073fee794188d5267f1812b095e51874839e) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6194] [SPARK-677] [PySpark] fix memory leak in collect()Davies Liu2015-03-133-34/+23
| | | | | | | | | | | | | | | | | | | | Because circular reference between JavaObject and JavaMember, an Java object can not be released until Python GC kick in, then it will cause memory leak in collect(), which may consume lots of memory in JVM. This PR change the way we sending collected data back into Python from local file to socket, which could avoid any disk IO during collect, also avoid any referrers of Java object in Python. cc JoshRosen Author: Davies Liu <davies@databricks.com> Closes #4923 from davies/fix_collect and squashes the following commits: d730286 [Davies Liu] address comments 24c92a4 [Davies Liu] fix style ba54614 [Davies Liu] use socket to transfer data from JVM 9517c8f [Davies Liu] fix memory leak in collect() (cherry picked from commit 8767565cef01d847f57b7293d8b63b2422009b90) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [mllib] [python] Add LassoModel to __all__ in regression.pyJoseph K. Bradley2015-03-121-2/+4
| | | | | | | | | | | | | | | | | Add LassoModel to __all__ in regression.py LassoModel does not show up in Python docs This should be merged into branch-1.3 and master. Author: Joseph K. Bradley <joseph@databricks.com> Closes #4970 from jkbradley/SPARK-6253 and squashes the following commits: c2cb533 [Joseph K. Bradley] Add LassoModel to __all__ in regression.py (cherry picked from commit 17c309c87e78da145dc358514150ec5700eed8f0) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-6294] fix hang when call take() in JVM on PythonRDDDavies Liu2015-03-122-1/+9
| | | | | | | | | | | | | | | | | The Thread.interrupt() can not terminate the thread in some cases, so we should not wait for the writerThread of PythonRDD. This PR also ignore some exception during clean up. cc JoshRosen mengxr Author: Davies Liu <davies@databricks.com> Closes #4987 from davies/fix_take and squashes the following commits: 4488f1a [Davies Liu] fix hang when call take() in JVM on PythonRDD (cherry picked from commit 712679a7b447346a365b38574d7a86d56a93f767) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [Docs] Replace references to SchemaRDD with DataFrameReynold Xin2015-03-092-3/+3
| | | | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4952 from rxin/schemardd-df-reference and squashes the following commits: b2b1dbe [Reynold Xin] [Docs] Replace references to SchemaRDD with DataFrame (cherry picked from commit 70f88148bb04161a1a4968230d8e3fc7e3f8321a) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6097][MLLIB] Support tree model save/load in PySpark/MLlibXiangrui Meng2015-03-024-15/+79
| | | | | | | | | | | | | | | | | | | Similar to `MatrixFactorizaionModel`, we only need wrappers to support save/load for tree models in Python. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #4854 from mengxr/SPARK-6097 and squashes the following commits: 4586a4d [Xiangrui Meng] fix more typos 8ebcac2 [Xiangrui Meng] fix python style 91172d8 [Xiangrui Meng] fix typos 201b3b9 [Xiangrui Meng] update user guide b5158e2 [Xiangrui Meng] support tree model save/load in PySpark/MLlib (cherry picked from commit 7e53a79c30511dbd0e5d9878a4b8b0f5bc94e68b) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-6127][Streaming][Docs] Add Kafka to Python api docsTathagata Das2015-03-021-0/+7
| | | | | | | | | | | | | davies Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #4860 from tdas/SPARK-6127 and squashes the following commits: 82de92a [Tathagata Das] Add Kafka to Python api docs (cherry picked from commit 9eb22ece115c69899d100cecb8a5e20b3a268649) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-6121][SQL][MLLIB] simpleString for UDTXiangrui Meng2015-03-022-1/+4
| | | | | | | | | | | | | | | `df.dtypes` shows `null` for UDTs. This PR uses `udt` by default and `VectorUDT` overwrites it with `vector`. jkbradley davies Author: Xiangrui Meng <meng@databricks.com> Closes #4858 from mengxr/SPARK-6121 and squashes the following commits: 34f0a77 [Xiangrui Meng] simpleString for UDT (cherry picked from commit 2db6a853a53b4c25e35983bc489510abb8a73e1d) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-6080] [PySpark] correct LogisticRegressionWithLBFGS regType parameter ↵Yanbo Liang2015-03-021-1/+1
| | | | | | | | | | | | | | | | for pyspark Currently LogisticRegressionWithLBFGS in python/pyspark/mllib/classification.py will invoke callMLlibFunc with a wrong "regType" parameter. It was assigned to "str(regType)" which translate None(Python) to "None"(Java/Scala). The right way should be translate None(Python) to null(Java/Scala) just as what we did at LogisticRegressionWithSGD. Author: Yanbo Liang <ybliang8@gmail.com> Closes #4831 from yanboliang/pyspark_classification and squashes the following commits: 12db65a [Yanbo Liang] correct LogisticRegressionWithLBFGS regType parameter for pyspark (cherry picked from commit af2effdd7b54316af0c02e781911acfb148b962b) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [Streaming][Minor]Fix some error docs in streaming examplesSaisai Shao2015-03-021-1/+1
| | | | | | | | | | | | | Small changes, please help to review, thanks a lot. Author: Saisai Shao <saisai.shao@intel.com> Closes #4837 from jerryshao/doc-fix and squashes the following commits: 545291a [Saisai Shao] Fix some error docs in streaming examples (cherry picked from commit d8fb40edea7c8c811814f1ff288d59178928964b) Signed-off-by: Sean Owen <sowen@cloudera.com>
* [SPARK-6053][MLLIB] support save/load in PySpark's ALSXiangrui Meng2015-03-012-2/+76
| | | | | | | | | | | | | | | | A simple wrapper to save/load `MatrixFactorizationModel` in Python. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #4811 from mengxr/SPARK-5991 and squashes the following commits: f135dac [Xiangrui Meng] update save doc 57e5200 [Xiangrui Meng] address comments 06140a4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5991 282ec8d [Xiangrui Meng] support save/load in PySpark's ALS (cherry picked from commit aedbbaa3dda9cbc154cd52c07f6d296b972b0eb2) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-6055] [PySpark] fix incorrect __eq__ of DataTypeDavies Liu2015-02-274-137/+86
| | | | | | | | | | | | | | | | | | | | | | | The _eq_ of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released. Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython). This PR also improve the performance of inferSchema (avoid the unnecessary converter of object). cc pwendell JoshRosen Author: Davies Liu <davies@databricks.com> Closes #4808 from davies/leak and squashes the following commits: 6a322a4 [Davies Liu] tests refactor 3da44fc [Davies Liu] fix __eq__ of Singleton 534ac90 [Davies Liu] add more checks 46999dc [Davies Liu] fix tests d9ae973 [Davies Liu] fix memory leak in sql (cherry picked from commit e0e64ba4b1b8eb72e856286f756c65fa22ab0a36) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [SPARK-6027][SPARK-5546] Fixed --jar and --packages not working for ↵Tathagata Das2015-02-261-15/+27
| | | | | | | | | | | | | | | | | | | | | KafkaUtils and improved error message The problem with SPARK-6027 in short is that JARs like the kafka-assembly.jar does not work in python as the added JAR is not visible in the classloader used by Py4J. Py4J uses Class.forName(), which does not uses the systemclassloader, but the JARs are only visible in the Thread's contextclassloader. So this back uses the context class loader to create the KafkaUtils dstream object. This works for both cases where the Kafka libraries are added with --jars spark-streaming-kafka-assembly.jar or with --packages spark-streaming-kafka Also improves the error message. davies Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #4779 from tdas/kafka-python-fix and squashes the following commits: fb16b04 [Tathagata Das] Removed import c1fdf35 [Tathagata Das] Fixed long line and improved documentation 7b88be8 [Tathagata Das] Fixed --jar not working for KafkaUtils and improved error message (cherry picked from commit aa63f633d39efa8c29095295f161eaad5495071d) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-6007][SQL] Add numRows param in DataFrame.show()Jacky Li2015-02-261-3/+3
| | | | | | | | | | | | | | | | | It is useful to let the user decide the number of rows to show in DataFrame.show Author: Jacky Li <jacky.likun@huawei.com> Closes #4767 from jackylk/show and squashes the following commits: a0e0f4b [Jacky Li] fix testcase 7cdbe91 [Jacky Li] modify according to comment bb54537 [Jacky Li] for Java compatibility d7acc18 [Jacky Li] modify according to comments 981be52 [Jacky Li] add numRows param in DataFrame.show() (cherry picked from commit 2358657547016d647cdd2e2d363426fcd8d3e9ff) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-5974] [SPARK-5980] [mllib] [python] [docs] Update ML guide with ↵Joseph K. Bradley2015-02-253-89/+141
| | | | | | | | | | | | | | | | | | | | | | | | save/load, Python GBT * Add GradientBoostedTrees Python examples to ML guide * I ran these in the pyspark shell, and they worked. * Add save/load to examples in ML guide * Added note to python docs about predict,transform not working within RDD actions,transformations in some cases (See SPARK-5981) CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #4750 from jkbradley/SPARK-5974 and squashes the following commits: c410e38 [Joseph K. Bradley] Added note to LabeledPoint about attributes bcae18b [Joseph K. Bradley] Added import of models for save/load examples in ml guide. Fixed line length for tree.py, feature.py (but not other ML Pyspark files yet). 6d81c3e [Joseph K. Bradley] completed python GBT examples 9903309 [Joseph K. Bradley] Added note to python docs about predict,transform not working within RDD actions,transformations in some cases c7dfad8 [Joseph K. Bradley] Added model save/load to ML guide. Added GBT examples to ML guide (cherry picked from commit d20559b157743981b9c09e286f2aaff8cbefab59) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-5944] [PySpark] fix version in Python API docsDavies Liu2015-02-253-4/+8
| | | | | | | | | | | | | | use RELEASE_VERSION when building the Python API docs Author: Davies Liu <davies@databricks.com> Closes #4731 from davies/api_version and squashes the following commits: c9744c9 [Davies Liu] Update create-release.sh 08cbc3f [Davies Liu] fix python docs (cherry picked from commit f3f4c87b3d944c10d1200dfe49091ebb2a149be6) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-5994] [SQL] Python DataFrame documentation fixesDavies Liu2015-02-246-183/+129
| | | | | | | | | | | | | | | | | | | | | select empty should NOT be the same as select. make sure selectExpr is behaving the same. join param documentation link to source doesn't work in jekyll generated file cross reference of columns (i.e. enabling linking) show(): move df example before df.show() move tests in SQLContext out of docstring otherwise doc is too long Column.desc and .asc doesn't have any documentation in documentation, sort functions.*) Author: Davies Liu <davies@databricks.com> Closes #4756 from davies/df_docs and squashes the following commits: f30502c [Davies Liu] fix doc 32f0d46 [Davies Liu] fix DataFrame docs (cherry picked from commit d641fbb39c90b1d734cc55396ca43d7e98788975) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-5985][SQL] DataFrame sortBy -> orderBy in Python.Reynold Xin2015-02-242-3/+11
| | | | | | | | | | | | | | | Also added desc/asc function for constructing sorting expressions more conveniently. And added a small fix to lift alias out of cast expression. Author: Reynold Xin <rxin@databricks.com> Closes #4752 from rxin/SPARK-5985 and squashes the following commits: aeda5ae [Reynold Xin] Added Experimental flag to ColumnName. 047ad03 [Reynold Xin] Lift alias out of cast. c9cf17c [Reynold Xin] [SPARK-5985][SQL] DataFrame sortBy -> orderBy in Python. (cherry picked from commit fba11c2f55dd81e4f6230e7edca3c7b2e01ccd9d) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-5973] [PySpark] fix zip with two RDDs with AutoBatchedSerializerDavies Liu2015-02-242-1/+7
| | | | | | | | | | | | | Author: Davies Liu <davies@databricks.com> Closes #4745 from davies/fix_zip and squashes the following commits: 2124b2c [Davies Liu] Update tests.py b5c828f [Davies Liu] increase the number of records c1e40fd [Davies Liu] fix zip with two RDDs with AutoBatchedSerializer (cherry picked from commit da505e59274d1c838653c1109db65ad374e65304) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-5873][SQL] Allow viewing of partially analyzed plans in queryExecutionMichael Armbrust2015-02-231-15/+15
| | | | | | | | | | | | | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #4684 from marmbrus/explainAnalysis and squashes the following commits: afbaa19 [Michael Armbrust] fix python d93278c [Michael Armbrust] fix hive e5fa0a4 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explainAnalysis 52119f2 [Michael Armbrust] more tests 82a5431 [Michael Armbrust] fix tests 25753d2 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explainAnalysis aee1e6a [Michael Armbrust] fix hive b23a844 [Michael Armbrust] newline de8dc51 [Michael Armbrust] more comments acf620a [Michael Armbrust] [SPARK-5873][SQL] Show partially analyzed plans in query execution (cherry picked from commit 1ed57086d402c38d95cda6c3d9d7aea806609bf9) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-5943][Streaming] Update the test to use new API to reduce the warningSaisai Shao2015-02-231-1/+1
| | | | | | | | | | | Author: Saisai Shao <saisai.shao@intel.com> Closes #4722 from jerryshao/SPARK-5943 and squashes the following commits: 1b01233 [Saisai Shao] Update the test to use new API to reduce the warning (cherry picked from commit 757b14b862a1d39c1bad7b321dae1a3ea8338fbb) Signed-off-by: Sean Owen <sowen@cloudera.com>
* [SPARK-5909][SQL] Add a clearCache command to Spark SQL's cache managerYin Huai2015-02-201-0/+4
| | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-5909 Author: Yin Huai <yhuai@databricks.com> Closes #4694 from yhuai/clearCache and squashes the following commits: 397ecc4 [Yin Huai] Address comments. a2702fc [Yin Huai] Update parser. 3a54506 [Yin Huai] add isEmpty to CacheManager. 6d14460 [Yin Huai] Python clearCache. f7b8dbd [Yin Huai] Add clear cache command.
* [SPARK-5898] [SPARK-5896] [SQL] [PySpark] create DataFrame from pandas and ↵Davies Liu2015-02-203-20/+20
| | | | | | | | | | | | | | | | | | tuple/list Fix createDataFrame() from pandas DataFrame (not tested by jenkins, depends on SPARK-5693). It also support to create DataFrame from plain tuple/list without column names, `_1`, `_2` will be used as column names. Author: Davies Liu <davies@databricks.com> Closes #4679 from davies/pandas and squashes the following commits: c0cbe0b [Davies Liu] fix tests 8466d1d [Davies Liu] fix create DataFrame from pandas (cherry picked from commit 5b0a42cb17b840c82d3f8a5ad061d99e261ceadf) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-5867] [SPARK-5892] [doc] [ml] [mllib] Doc cleanups for 1.3 releaseJoseph K. Bradley2015-02-2013-34/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For SPARK-5867: * The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API. * It should also include Python examples now. For SPARK-5892: * Fix Python docs * Various other cleanups BTW, I accidentally merged this with master. If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check] CC: mengxr (ML), davies (Python docs) Author: Joseph K. Bradley <joseph@databricks.com> Closes #4675 from jkbradley/doc-review-1.3 and squashes the following commits: f191bb0 [Joseph K. Bradley] small cleanups e786efa [Joseph K. Bradley] small doc corrections 6b1ab4a [Joseph K. Bradley] fixed python lint test 946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example. Changed spark.ml Java examples to use DataFrames API instead of sql() da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3 629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation 34b067f [Joseph K. Bradley] small doc correction da16aef [Joseph K. Bradley] Fixed python mllib docs 8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc 695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs b05a80d [Joseph K. Bradley] organize imports. doc cleanups e572827 [Joseph K. Bradley] updated programming guide for ml and mllib (cherry picked from commit 4a17eedb16343413e5b6f8bb58c6da8952ee7ab6) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-5904][SQL] DataFrame API fixes.Reynold Xin2015-02-191-36/+20
| | | | | | | | | | | | | | | | 1. Column is no longer a DataFrame to simplify class hierarchy. 2. Don't use varargs on abstract methods (see Scala compiler bug SI-9013). Author: Reynold Xin <rxin@databricks.com> Closes #4686 from rxin/SPARK-5904 and squashes the following commits: fd9b199 [Reynold Xin] Fixed Python tests. df25cef [Reynold Xin] Non final. 5221530 [Reynold Xin] [SPARK-5904][SQL] DataFrame API fixes. Conflicts: sql/core/src/main/scala/org/apache/spark/sql/DataFrameImpl.scala
* [SPARK-5722] [SQL] [PySpark] infer int as LongTypeDavies Liu2015-02-183-11/+33
| | | | | | | | | | | | | | | The `int` is 64-bit on 64-bit machine (very common now), we should infer it as LongType for it in Spark SQL. Also, LongType in SQL will come back as `int`. Author: Davies Liu <davies@databricks.com> Closes #4666 from davies/long and squashes the following commits: 6bc6cc4 [Davies Liu] infer int as LongType (cherry picked from commit aa8f10e82a743d59ce87348af19c0177eb618a66) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-5878] fix DataFrame.repartition() in PythonDavies Liu2015-02-181-1/+7
| | | | | | | | | | | | | | Also add tests for distinct() Author: Davies Liu <davies@databricks.com> Closes #4667 from davies/repartition and squashes the following commits: 79059fd [Davies Liu] add test cb4915e [Davies Liu] fix repartition (cherry picked from commit c1b6fa9838f9d26d60fab3b05a96649882e3dd5b) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-5811] Added documentation for maven coordinates and added Spark ↵Burak Yavuz2015-02-171-4/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | Packages support Documentation for maven coordinates + Spark Package support. Added pyspark tests for `--packages` Author: Burak Yavuz <brkyvz@gmail.com> Author: Davies Liu <davies@databricks.com> Closes #4662 from brkyvz/SPARK-5811 and squashes the following commits: 56ccccd [Burak Yavuz] fixed broken test 64cb8ee [Burak Yavuz] passed pep8 on local c07b81e [Burak Yavuz] fixed pep8 a8bd6b7 [Burak Yavuz] submit PR 4ef4046 [Burak Yavuz] ready for PR 8fb02e5 [Burak Yavuz] merged master 25c9b9f [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into python-jar 560d13b [Burak Yavuz] before PR 17d3f76 [Davies Liu] support .jar as python package a3eb717 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into SPARK-5811 c60156d [Burak Yavuz] [SPARK-5811] Added documentation for maven coordinates (cherry picked from commit ae6cfb3acdbc2721d25793698a4a440f0519dbec) Signed-off-by: Patrick Wendell <patrick@databricks.com>
* [SPARK-5785] [PySpark] narrow dependency for cogroup/join in PySparkDavies Liu2015-02-174-22/+75
| | | | | | | | | | | | | | | | | | | | | Currently, PySpark does not support narrow dependency during cogroup/join when the two RDDs have the partitioner, another unnecessary shuffle stage will come in. The Python implementation of cogroup/join is different than Scala one, it depends on union() and partitionBy(). This patch will try to use PartitionerAwareUnionRDD() in union(), when all the RDDs have the same partitioner. It also fix `reservePartitioner` in all the map() or mapPartitions(), then partitionBy() can skip the unnecessary shuffle stage. Author: Davies Liu <davies@databricks.com> Closes #4629 from davies/narrow and squashes the following commits: dffe34e [Davies Liu] improve test, check number of stages for join/cogroup 1ed3ba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into narrow 4d29932 [Davies Liu] address comment cc28d97 [Davies Liu] add unit tests 940245e [Davies Liu] address comments ff5a0a6 [Davies Liu] skip the partitionBy() on Python side eb26c62 [Davies Liu] narrow dependency in PySpark (cherry picked from commit c3d2b90bde2e11823909605d518167548df66bd8) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [SPARK-5872] [SQL] create a sqlCtx in pyspark shellDavies Liu2015-02-172-3/+22
| | | | | | | | | | | | | | | The sqlCtx will be HiveContext if hive is built in assembly jar, or SQLContext if not. It also skip the Hive tests in pyspark.sql.tests if no hive is available. Author: Davies Liu <davies@databricks.com> Closes #4659 from davies/sqlctx and squashes the following commits: 0e6629a [Davies Liu] sqlCtx in pyspark (cherry picked from commit 4d4cc760fa9687ce563320094557ef9144488676) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-5871] output explain in PythonDavies Liu2015-02-171-3/+20
| | | | | | | | | | | Author: Davies Liu <davies@databricks.com> Closes #4658 from davies/explain and squashes the following commits: db87ea2 [Davies Liu] output explain in Python (cherry picked from commit 3df85dccbc8fd1ba19bbcdb8d359c073b1494d98) Signed-off-by: Michael Armbrust <michael@databricks.com>