aboutsummaryrefslogtreecommitdiff
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-7969] [SQL] Added a DataFrame.drop function that accepts a Column ↵Mike Dusenberry2015-06-041-3/+18
| | | | | | | | | | | | | | | | | | reference. Added a `DataFrame.drop` function that accepts a `Column` reference rather than a `String`, and added associated unit tests. Basically iterates through the `DataFrame` to find a column with an expression that is equivalent to that of the `Column` argument supplied to the function. Author: Mike Dusenberry <dusenberrymw@gmail.com> Closes #6585 from dusenberrymw/SPARK-7969_Drop_method_on_Dataframes_should_handle_Column and squashes the following commits: 514727a [Mike Dusenberry] Updating the @since tag of the drop(Column) function doc to reflect version 1.4.1 instead of 1.4.0. 2f1bb4e [Mike Dusenberry] Adding an additional assert statement to the 'drop column after join' unit test in order to make sure the correct column was indeed left over. 6bf7c0e [Mike Dusenberry] Minor code formatting change. e583888 [Mike Dusenberry] Adding more Python doctests for the df.drop with column reference function to test joined datasets that have columns with the same name. 5f74401 [Mike Dusenberry] Updating DataFrame.drop with column reference function to use logicalPlan.output to prevent ambiguities resulting from columns with the same name. Also added associated unit tests for joined datasets with duplicate column names. 4b8bbe8 [Mike Dusenberry] Adding Python support for Dataframe.drop with a Column reference. 986129c [Mike Dusenberry] Added a DataFrame.drop function that accepts a Column reference rather than a String, and added associated unit tests. Basically iterates through the DataFrame to find a column with an expression that is equivalent to one supplied to the function.
* Update documentation for [SPARK-7980] [SQL] Support SQLContext.range(end)Reynold Xin2015-06-031-0/+2
|
* [SPARK-7980] [SQL] Support SQLContext.range(end)animesh2015-06-032-2/+12
| | | | | | | | | | | | 1. range() overloaded in SQLContext.scala 2. range() modified in python sql context.py 3. Tests added accordingly in DataFrameSuite.scala and python sql tests.py Author: animesh <animesh@apache.spark> Closes #6609 from animeshbaranawal/SPARK-7980 and squashes the following commits: 935899c [animesh] SPARK-7980:python+scala changes
* [SPARK-8060] Improve DataFrame Python test coverage and documentation.Reynold Xin2015-06-0319-227/+179
| | | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #6601 from rxin/python-read-write-test-and-doc and squashes the following commits: baa8ad5 [Reynold Xin] Code review feedback. f081d47 [Reynold Xin] More documentation updates. c9902fa [Reynold Xin] [SPARK-8060] Improve DataFrame Python reader/writer interface doc and testing.
* [SPARK-8032] [PYSPARK] Make version checking for NumPy in MLlib more robustMechCoder2015-06-021-1/+3
| | | | | | | | | | | | | | | | The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x > 4, however `1.x` < `1.4` It fails in my system since I have version `1.10` :P Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6579 from MechCoder/np_ver and squashes the following commits: 15430f8 [MechCoder] fix syntax error 893fb7e [MechCoder] remove equal to e35f0d4 [MechCoder] minor e89376c [MechCoder] Better checking 22703dd [MechCoder] [SPARK-8032] Make version checking for NumPy in MLlib more robust
* [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()Davies Liu2015-06-021-3/+28
| | | | | | | | | | | | Thanks ogirardot, closes #6580 cc rxin JoshRosen Author: Davies Liu <davies@databricks.com> Closes #6590 from davies/when and squashes the following commits: c0f2069 [Davies Liu] fix Column.when() and otherwise()
* [SPARK-7432] [MLLIB] fix flaky CrossValidator doctestXiangrui Meng2015-06-021-10/+9
| | | | | | | | | | The new test uses CV to compare `maxIter=0` and `maxIter=1`, and validate on the evaluation result. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6572 from mengxr/SPARK-7432 and squashes the following commits: c236bb8 [Xiangrui Meng] fix flacky cv doctest
* [SPARK-8021] [SQL] [PYSPARK] make Python read/write API consistent with ScalaDavies Liu2015-06-021-27/+94
| | | | | | | | | | | | | | add schema()/format()/options() for reader, add mode()/format()/options()/partitionBy() for writer cc rxin yhuai pwendell Author: Davies Liu <davies@databricks.com> Closes #6578 from davies/readwrite and squashes the following commits: 720d293 [Davies Liu] address comments b65dfa2 [Davies Liu] Update readwriter.py 1299ab6 [Davies Liu] make Python API consistent with Scala
* [minor doc] Add exploratory data analysis warning for ↵Reynold Xin2015-06-011-0/+3
| | | | | | | | | | DataFrame.stat.freqItem API Author: Reynold Xin <rxin@databricks.com> Closes #6569 from rxin/freqItemsWarning and squashes the following commits: 7eec145 [Reynold Xin] [minor doc] Add exploratory data analysis warning for DataFrame.stat.freqItem API.
* [SPARK-7497] [PYSPARK] [STREAMING] fix streaming flaky testsDavies Liu2015-06-011-8/+8
| | | | | | | | | | | | Increase the duration and timeout in streaming python tests. Author: Davies Liu <davies@databricks.com> Closes #6239 from davies/flaky_tests and squashes the following commits: d6aee8f [Davies Liu] fix window tests 26317f7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into flaky_tests 7947db6 [Davies Liu] fix streaming flaky tests
* [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singletonDavies Liu2015-05-312-2/+25
| | | | | | | | | Author: Davies Liu <davies@databricks.com> Closes #6532 from davies/decimal and squashes the following commits: c7fcbce [Davies Liu] Update tests.py 1425359 [Davies Liu] DecimalType should not be singleton
* [MINOR] Enable PySpark SQL readerwriter and window testsJosh Rosen2015-05-311-0/+2
| | | | | | | | | | PySpark SQL's `readerwriter` and `window` doctests weren't being run by our test runner script; this patch re-enables them. Author: Josh Rosen <joshrosen@databricks.com> Closes #6542 from JoshRosen/enable-more-pyspark-sql-tests and squashes the following commits: 9f46ce4 [Josh Rosen] Enable PySpark SQL readerwriter and window tests.
* [SPARK-7918] [MLLIB] MLlib Python doc parity check for evaluation and featureYanbo Liang2015-05-302-39/+36
| | | | | | | | | | | Check then make the MLlib Python evaluation and feature doc to be as complete as the Scala doc. Author: Yanbo Liang <ybliang8@gmail.com> Closes #6461 from yanboliang/spark-7918 and squashes the following commits: 940e3f1 [Yanbo Liang] truncate too long line and remove extra sparse a80ae58 [Yanbo Liang] MLlib Python doc parity check for evaluation and feature
* [SPARK-7899] [PYSPARK] Fix Python 3 pyspark/sql/types module conflictMichael Nazario2015-05-296-58/+42
| | | | | | | | | | | | | | | This PR makes the types module in `pyspark/sql/types` work with pylint static analysis by removing the dynamic naming of the `pyspark/sql/_types` module to `pyspark/sql/types`. Tests are now loaded using `$PYSPARK_DRIVER_PYTHON -m module` rather than `$PYSPARK_DRIVER_PYTHON module.py`. The old method adds the location of `module.py` to `sys.path`, so this change prevents accidental use of relative paths in Python. Author: Michael Nazario <mnazario@palantir.com> Closes #6439 from mnazario/feature/SPARK-7899 and squashes the following commits: 366ef30 [Michael Nazario] Remove hack on random.py bb8b04d [Michael Nazario] Make doctests consistent with other tests 6ee4f75 [Michael Nazario] Change test scripts to use "-m" 673528f [Michael Nazario] Move _types back to types
* [SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML ↵Xiangrui Meng2015-05-291-25/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | attributes and change includeFirst to dropLast This PR contains two major changes to `OneHotEncoder`: 1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index 2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits: a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm) b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1. c. If users use `StringIndex`, the last element is the least frequent one. Sorry for including two changes in one PR! I'll update the user guide in another PR. jkbradley sryza Author: Xiangrui Meng <meng@databricks.com> Closes #6466 from mengxr/SPARK-7912 and squashes the following commits: a280dca [Xiangrui Meng] fix tests d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912 171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's 00dfd96 [Xiangrui Meng] update OneHotEncoder in Python 208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
* [SPARK-7922] [MLLIB] use DataFrames for user/item factors in ALSModelXiangrui Meng2015-05-282-3/+32
| | | | | | | | | | | | | Expose user/item factors in DataFrames. This is to be more consistent with the pipeline API. It also helps maintain consistent APIs across languages. This PR also removed fitting params from `ALSModel`. coderxiang Author: Xiangrui Meng <meng@databricks.com> Closes #6468 from mengxr/SPARK-7922 and squashes the following commits: 7bfb1d5 [Xiangrui Meng] update ALSModel in PySpark 1ba5607 [Xiangrui Meng] use DataFrames for user/item factors in ALS
* [MINOR] fix RegressionEvaluator docXiangrui Meng2015-05-281-1/+1
| | | | | | | | | | | | | | | `make clean html` under `python/doc` returns ~~~ /Users/meng/src/spark/python/pyspark/ml/evaluation.py:docstring of pyspark.ml.evaluation.RegressionEvaluator.setParams:3: WARNING: Definition list ends without a blank line; unexpected unindent. ~~~ harsha2010 Author: Xiangrui Meng <meng@databricks.com> Closes #6469 from mengxr/fix-regression-evaluator-doc and squashes the following commits: 91e2dad [Xiangrui Meng] fix RegressionEvaluator doc
* [SPARK-7339] [PYSPARK] PySpark shuffle spill memory sometimes are not correctlinweizhong2015-05-261-4/+4
| | | | | | | | | | | | | | | | | | | | | | In PySpark we get memory used before and after spill, then use the difference of these two value as memorySpilled, but if the before value is small than after value, then we will get a negative value, but this scenario 0 value may be more reasonable. Below is the result in HistoryServer we have tested: Index ID Attempt Status Locality Level Executor ID / Host Launch Time Duration GC Time Input Size / Records Write Time Shuffle Write Size / Records Shuffle Spill (Memory) Shuffle Spill (Disk) Errors 0 0 0 SUCCESS NODE_LOCAL 3 / vm119 2015/05/04 17:31:06 21 s 0.1 s 128.1 MB (hadoop) / 3237 70 ms 10.1 MB / 2529 0.0 B 5.7 MB 2 2 0 SUCCESS NODE_LOCAL 1 / vm118 2015/05/04 17:31:06 22 s 89 ms 128.1 MB (hadoop) / 3205 0.1 s 10.1 MB / 2529 -1048576.0 B 5.9 MB 1 1 0 SUCCESS NODE_LOCAL 2 / vm117 2015/05/04 17:31:06 22 s 0.1 s 128.1 MB (hadoop) / 3271 68 ms 10.1 MB / 2529 -1048576.0 B 5.6 MB 4 4 0 SUCCESS NODE_LOCAL 2 / vm117 2015/05/04 17:31:06 22 s 0.1 s 128.1 MB (hadoop) / 3192 51 ms 10.1 MB / 2529 -1048576.0 B 5.9 MB 3 3 0 SUCCESS NODE_LOCAL 3 / vm119 2015/05/04 17:31:06 22 s 0.1 s 128.1 MB (hadoop) / 3262 51 ms 10.1 MB / 2529 1024.0 KB 5.8 MB 5 5 0 SUCCESS NODE_LOCAL 1 / vm118 2015/05/04 17:31:06 22 s 89 ms 128.1 MB (hadoop) / 3256 93 ms 10.1 MB / 2529 -1048576.0 B 5.7 MB /cc davies Author: linweizhong <linweizhong@huawei.com> Closes #5887 from Sephiroth-Lin/spark-7339 and squashes the following commits: 9186c81 [linweizhong] Use max function to get a nonnegative value d41672b [linweizhong] Update MemoryBytesSpilled when memorySpilled > 0
* [SPARK-7833] [ML] Add python wrapper for RegressionEvaluatorRam Sriharsha2015-05-241-2/+66
| | | | | | | | | | Author: Ram Sriharsha <rsriharsha@hw11853.local> Closes #6365 from harsha2010/SPARK-7833 and squashes the following commits: 923f288 [Ram Sriharsha] cleanup 7623b7d [Ram Sriharsha] python style fix 9743f83 [Ram Sriharsha] [SPARK-7833][ml] Add python wrapper for RegressionEvaluator
* [SPARK-7840] add insertInto() to WriterDavies Liu2015-05-232-8/+16
| | | | | | | | | | Add tests later. Author: Davies Liu <davies@databricks.com> Closes #6375 from davies/insertInto and squashes the following commits: 826423e [Davies Liu] add insertInto() to Writer
* [SPARK-7322, SPARK-7836, SPARK-7822][SQL] DataFrame window function related ↵Davies Liu2015-05-238-56/+365
| | | | | | | | | | | | | | | | | | | | | | | updates 1. ntile should take an integer as parameter. 2. Added Python API (based on #6364) 3. Update documentation of various DataFrame Python functions. Author: Davies Liu <davies@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #6374 from rxin/window-final and squashes the following commits: 69004c7 [Reynold Xin] Style fix. 288cea9 [Reynold Xin] Update documentaiton. 7cb8985 [Reynold Xin] Merge pull request #6364 from davies/window 66092b4 [Davies Liu] update docs ed73cb4 [Reynold Xin] [SPARK-7322][SQL] Improve DataFrame window function documentation. ef55132 [Davies Liu] Merge branch 'master' of github.com:apache/spark into window4 8936ade [Davies Liu] fix maxint in python 3 2649358 [Davies Liu] update docs 778e2c0 [Davies Liu] SPARK-7836 and SPARK-7822: Python API of window functions
* [SPARK-7535] [.0] [MLLIB] Audit the pipeline APIs for 1.4Xiangrui Meng2015-05-214-61/+64
| | | | | | | | | | | | | | | | | | | | | | | | Some changes to the pipeilne APIs: 1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage already does. 1. Move Evaluator to ml.evaluation. 1. Mention larger metric values are better. 1. PipelineModel doc. “compiled” -> “fitted” 1. Hide object PolynomialExpansion. 1. Hide object VectorAssembler. 1. Word2Vec.minCount (and other) -> group param 1. ParamValidators -> DeveloperApi 1. Hide MetadataUtils/SchemaUtils. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6322 from mengxr/SPARK-7535.0 and squashes the following commits: 9e9c7da [Xiangrui Meng] move JavaEvaluator to ml.evaluation as well e179480 [Xiangrui Meng] move Evaluation to ml.evaluation in PySpark 08ef61f [Xiangrui Meng] update pipieline APIs
* [SPARK-7794] [MLLIB] update RegexTokenizer default settingsXiangrui Meng2015-05-211-21/+19
| | | | | | | | | | The previous default is `{gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6330 from mengxr/SPARK-7794 and squashes the following commits: 5ee7cde [Xiangrui Meng] update RegexTokenizer default settings
* [SPARK-7783] [SQL] [PySpark] add DataFrame.rollup/cube in PythonDavies Liu2015-05-211-2/+46
| | | | | | | | | | | Author: Davies Liu <davies@databricks.com> Closes #6311 from davies/rollup and squashes the following commits: 0261db1 [Davies Liu] use @since a51ca6b [Davies Liu] Merge branch 'master' of github.com:apache/spark into rollup 8ad5af4 [Davies Liu] Update dataframe.py ade3841 [Davies Liu] add DataFrame.rollup/cube in Python
* [SPARK-7711] Add a startTime property to match the corresponding one in ScalaHolden Karau2015-05-212-0/+9
| | | | | | | | | | Author: Holden Karau <holden@pigscanfly.ca> Closes #6275 from holdenk/SPARK-771-startTime-is-missing-from-pyspark and squashes the following commits: 06662dc [Holden Karau] add mising blank line for style checks 7a87410 [Holden Karau] add back missing newline 7a7876b [Holden Karau] Add a startTime property to match the corresponding one in the Scala SparkContext
* [SPARK-7394][SQL] Add Pandas style cast (astype)kaka19922015-05-211-0/+2
| | | | | | | | | | Author: kaka1992 <kaka_1992@163.com> Closes #6313 from kaka1992/astype and squashes the following commits: 73dfd0b [kaka1992] [SPARK-7394] Add Pandas style cast (astype) ad8feb2 [kaka1992] [SPARK-7394] Add Pandas style cast (astype) 4f328b7 [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
* [SPARK-6416] [DOCS] RDD.fold() requires the operator to be commutativeSean Owen2015-05-211-2/+10
| | | | | | | | | | | | | | Document current limitation of rdd.fold. This does not resolve SPARK-6416 but just documents the issue. CC JoshRosen Author: Sean Owen <sowen@cloudera.com> Closes #6231 from srowen/SPARK-6416 and squashes the following commits: 9fef39f [Sean Owen] Add comment to other languages; reword to highlight the difference from non-distributed collections and to not suggest it is a bug that is to be fixed da40d84 [Sean Owen] Document current limitation of rdd.fold.
* [SPARK-7606] [SQL] [PySpark] add version to Python SQL API docsDavies Liu2015-05-207-18/+170
| | | | | | | | | | | | | Add version info for public Python SQL API. cc rxin Author: Davies Liu <davies@databricks.com> Closes #6295 from davies/versions and squashes the following commits: cfd91e6 [Davies Liu] add more version for DataFrame API 600834d [Davies Liu] add version to SQL API docs
* [SPARK-7762] [MLLIB] set default value for outputColXiangrui Meng2015-05-202-2/+3
| | | | | | | | | | | | | Set a default value for `outputCol` instead of forcing users to name it. This is useful for intermediate transformers in the pipeline. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6289 from mengxr/SPARK-7762 and squashes the following commits: 54edebc [Xiangrui Meng] merge master bff8667 [Xiangrui Meng] update unit test 171246b [Xiangrui Meng] add unit test for outputCol a4321bd [Xiangrui Meng] set default value for outputCol
* [SPARK-7511] [MLLIB] pyspark ml seed param should be random by default or 42 ↵Holden Karau2015-05-208-64/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | is quite funny but not very random Author: Holden Karau <holden@pigscanfly.ca> Closes #6139 from holdenk/SPARK-7511-pyspark-ml-seed-param-should-be-random-by-default-or-42-is-quite-funny-but-not-very-random and squashes the following commits: 591f8e5 [Holden Karau] specify old seed for doc tests 2470004 [Holden Karau] Fix a bunch of seeds with default values to have None as the default which will then result in using the hash of the class name cbad96d [Holden Karau] Add the setParams function that is used in the real code 423b8d7 [Holden Karau] Switch the test code to behave slightly more like production code. also don't check the param map value only check for key existence 140d25d [Holden Karau] remove extra space 926165a [Holden Karau] Add some missing newlines for pep8 style 8616751 [Holden Karau] merge in master 58532e6 [Holden Karau] its the __name__ method, also treat None values as not set 56ef24a [Holden Karau] fix test and regenerate base afdaa5c [Holden Karau] make sure different classes have different results 68eb528 [Holden Karau] switch default seed to hash of type of self 89c4611 [Holden Karau] Merge branch 'master' into SPARK-7511-pyspark-ml-seed-param-should-be-random-by-default-or-42-is-quite-funny-but-not-very-random 31cd96f [Holden Karau] specify the seed to randomforestregressor test e1b947f [Holden Karau] Style fixes ce90ec8 [Holden Karau] merge in master bcdf3c9 [Holden Karau] update docstring seeds to none and some other default seeds from 42 65eba21 [Holden Karau] pep8 fixes 0e3797e [Holden Karau] Make seed default to random in more places 213a543 [Holden Karau] Simplify the generated code to only include set default if there is a default rather than having None is note None in the generated code 1ff17c2 [Holden Karau] Make the seed random for HasSeed in python
* [SPARK-6094] [MLLIB] Add MultilabelMetrics in PySpark/MLlibYanbo Liang2015-05-201-0/+117
| | | | | | | | | | Add MultilabelMetrics in PySpark/MLlib Author: Yanbo Liang <ybliang8@gmail.com> Closes #6276 from yanboliang/spark-6094 and squashes the following commits: b8e3343 [Yanbo Liang] Add MultilabelMetrics in PySpark/MLlib
* [SPARK-7738] [SQL] [PySpark] add reader and writer API in PythonDavies Liu2015-05-195-90/+421
| | | | | | | | | | | | | | | cc rxin, please take a quick look, I'm working on tests. Author: Davies Liu <davies@databricks.com> Closes #6238 from davies/readwrite and squashes the following commits: c7200eb [Davies Liu] update tests 9cbf01b [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite f0c5a04 [Davies Liu] use sqlContext.read.load 5f68bc8 [Davies Liu] update tests 6437e9a [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite bcc6668 [Davies Liu] add reader amd writer API in Python
* [SPARK-7150] SparkContext.range() and SQLContext.range()Daoyuan Wang2015-05-184-0/+46
| | | | | | | | | | | | | | | | | | | | | | This PR is based on #6081, thanks adrian-wang. Closes #6081 Author: Daoyuan Wang <daoyuan.wang@intel.com> Author: Davies Liu <davies@databricks.com> Closes #6230 from davies/range and squashes the following commits: d3ce5fe [Davies Liu] add tests 789eda5 [Davies Liu] add range() in Python 4590208 [Davies Liu] Merge commit 'refs/pull/6081/head' of github.com:apache/spark into range cbf5200 [Daoyuan Wang] let's add python support in a separate PR f45e3b2 [Daoyuan Wang] remove redundant toLong 617da76 [Daoyuan Wang] fix safe marge for corner cases 867c417 [Daoyuan Wang] fix 13dbe84 [Daoyuan Wang] update bd998ba [Daoyuan Wang] update comments d3a0c1b [Daoyuan Wang] add range api()
* [SPARK-6216] [PYSPARK] check python version of worker with driverDavies Liu2015-05-186-12/+16
| | | | | | | | | | | | This PR revert #5404, change to pass the version of python in driver into JVM, check it in worker before deserializing closure, then it can works with different major version of Python. Author: Davies Liu <davies@databricks.com> Closes #6203 from davies/py_version and squashes the following commits: b8fb76e [Davies Liu] fix test 6ce5096 [Davies Liu] use string for version 47c6278 [Davies Liu] check python version of worker with driver
* [SPARK-7380] [MLLIB] pipeline stages should be copyable in PythonXiangrui Meng2015-05-1813-254/+490
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes: 1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively. 2. Accept a list of param maps in `fit`. 3. Use parent uid and name to identify param. jkbradley Author: Xiangrui Meng <meng@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #6088 from mengxr/SPARK-7380 and squashes the following commits: 413c463 [Xiangrui Meng] remove unnecessary doc 4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380 611c719 [Xiangrui Meng] fix python style 68862b8 [Xiangrui Meng] update _java_obj initialization 927ad19 [Xiangrui Meng] fix ml/tests.py 0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer 9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params 7e0d27f [Xiangrui Meng] merge master 46840fb [Xiangrui Meng] update wrappers b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap 46cb6ed [Xiangrui Meng] merge master a163413 [Xiangrui Meng] fix style 1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380 9630eae [Xiangrui Meng] fix Identifiable._randomUID 13bd70a [Xiangrui Meng] update ml/tests.py 64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl 02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python 66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui 7431272 [Joseph K. Bradley] Rebased with master
* [SPARK-6657] [PYSPARK] Fix doc warningsXiangrui Meng2015-05-184-10/+11
| | | | | | | | | | | | | | | | | | | | | | | Fixed the following warnings in `make clean html` under `python/docs`: ~~~ /Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:3: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:4: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:3: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:4: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.replace:16: WARNING: Field list ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:8: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:9: WARNING: Block quote ends without a blank line; unexpected unindent. ~~~ davies Author: Xiangrui Meng <meng@databricks.com> Closes #6221 from mengxr/SPARK-6657 and squashes the following commits: e3f83fe [Xiangrui Meng] fix sql and streaming doc warnings 2b4371e [Xiangrui Meng] fix mllib python doc warnings
* [SPARK-7543] [SQL] [PySpark] split dataframe.py into multiple filesDavies Liu2015-05-156-449/+552
| | | | | | | | | | | | | | | dataframe.py is splited into column.py, group.py and dataframe.py: ``` 360 column.py 1223 dataframe.py 183 group.py ``` Author: Davies Liu <davies@databricks.com> Closes #6201 from davies/split_df and squashes the following commits: fc8f5ab [Davies Liu] split dataframe.py into multiple files
* [SPARK-7073] [SQL] [PySpark] Clean up SQL data type hierarchy in PythonDavies Liu2015-05-151-30/+46
| | | | | | | | Author: Davies Liu <davies@databricks.com> Closes #6206 from davies/sql_type and squashes the following commits: 33d6860 [Davies Liu] [SPARK-7073] [SQL] [PySpark] Clean up SQL data type hierarchy in Python
* [SPARK-7651] [MLLIB] [PYSPARK] GMM predict, predictSoft should raise error ↵FlytxtRnD2015-05-151-0/+6
| | | | | | | | | | | | on bad input In the Python API for Gaussian Mixture Model, predict() and predictSoft() methods should raise an error when the input argument is not an RDD. Author: FlytxtRnD <meethu.mathew@flytxt.com> Closes #6180 from FlytxtRnD/GmmPredictException and squashes the following commits: 4b6aa11 [FlytxtRnD] Raise error if the input to predict()/predictSoft() is not an RDD
* [SPARK-6258] [MLLIB] GaussianMixture Python API parity checkYanbo Liang2015-05-151-14/+53
| | | | | | | | | | | | | | | | | | | Implement Python API for major disparities of GaussianMixture cluster algorithm between Scala & Python ```scala GaussianMixture setInitialModel GaussianMixtureModel k ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #6087 from yanboliang/spark-6258 and squashes the following commits: b3af21c [Yanbo Liang] fix typo 2b645c1 [Yanbo Liang] fix doc 638b4b7 [Yanbo Liang] address comments b5bcade [Yanbo Liang] GaussianMixture Python API parity check
* [SPARK-7548] [SQL] Add explode function for DataFramesMichael Armbrust2015-05-143-3/+44
| | | | | | | | | | | | | | | | | | | | | | | Add an `explode` function for dataframes and modify the analyzer so that single table generating functions can be present in a select clause along with other expressions. There are currently the following restrictions: - only top level TGFs are allowed (i.e. no `select(explode('list) + 1)`) - only one may be present in a single select to avoid potentially confusing implicit Cartesian products. TODO: - [ ] Python Author: Michael Armbrust <michael@databricks.com> Closes #6107 from marmbrus/explodeFunction and squashes the following commits: 7ee2c87 [Michael Armbrust] whitespace 6f80ba3 [Michael Armbrust] Update dataframe.py c176c89 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explodeFunction 81b5da3 [Michael Armbrust] style d3faa05 [Michael Armbrust] fix self join case f9e1e3e [Michael Armbrust] fix python, add since 4f0d0a9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explodeFunction e710fe4 [Michael Armbrust] add java and python 52ca0dc [Michael Armbrust] [SPARK-7548][SQL] Add explode function for dataframes.
* [SPARK-7619] [PYTHON] fix docstring signatureXiangrui Meng2015-05-145-55/+52
| | | | | | | | | | Just realized that we need `\` at the end of the docstring. brkyvz Author: Xiangrui Meng <meng@databricks.com> Closes #6161 from mengxr/SPARK-7619 and squashes the following commits: e44495f [Xiangrui Meng] fix docstring signature
* [SPARK-7648] [MLLIB] Add weights and intercept to GLM wrappers in spark.mlXiangrui Meng2015-05-143-1/+43
| | | | | | | | | | | Otherwise, users can only use `transform` on the models. brkyvz Author: Xiangrui Meng <meng@databricks.com> Closes #6156 from mengxr/SPARK-7647 and squashes the following commits: 1ae3d2d [Xiangrui Meng] add weights and intercept to LogisticRegression in Python f49eb46 [Xiangrui Meng] add weights and intercept to LinearRegressionModel
* [SPARK-7278] [PySpark] DateType should find datetime.datetime acceptableksonj2015-05-141-1/+1
| | | | | | | | | | DateType should not be restricted to `datetime.date` but accept `datetime.datetime` objects as well. Could someone with a little more insight verify this? Author: ksonj <kson@siberie.de> Closes #6057 from ksonj/dates and squashes the following commits: 68a158e [ksonj] DateType should find datetime.datetime acceptable too
* [SPARK-7382] [MLLIB] Feature Parity in PySpark for ml.classificationBurak Yavuz2015-05-133-10/+501
| | | | | | | | | | | | | The missing pieces in ml.classification for Python! cc mengxr Author: Burak Yavuz <brkyvz@gmail.com> Closes #6106 from brkyvz/ml-class and squashes the following commits: dd78237 [Burak Yavuz] fix style 1048e29 [Burak Yavuz] ready for PR
* [SPARK-7593] [ML] Python Api for ml.feature.BucketizerBurak Yavuz2015-05-131-0/+77
| | | | | | | | | | | | | Added `ml.feature.Bucketizer` to PySpark. cc mengxr Author: Burak Yavuz <brkyvz@gmail.com> Closes #6124 from brkyvz/ml-bucket and squashes the following commits: 05285be [Burak Yavuz] added sphinx doc 6abb6ed [Burak Yavuz] added support for Bucketizer
* [SPARK-7321][SQL] Add Column expression for conditional statements ↵Reynold Xin2015-05-123-2/+57
| | | | | | | | | | | | | | | | | | | | | | | | | (when/otherwise) This builds on https://github.com/apache/spark/pull/5932 and should close https://github.com/apache/spark/pull/5932 as well. As an example: ```python df.select(when(df['age'] == 2, 3).otherwise(4).alias("age")).collect() ``` Author: Reynold Xin <rxin@databricks.com> Author: kaka1992 <kaka_1992@163.com> Closes #6072 from rxin/when-expr and squashes the following commits: 8f49201 [Reynold Xin] Throw exception if otherwise is applied twice. 0455eda [Reynold Xin] Reset run-tests. bfb9d9f [Reynold Xin] Updated documentation and test cases. 762f6a5 [Reynold Xin] Merge pull request #5932 from kaka1992/IFCASE 95724c6 [kaka1992] Update 8218d0a [kaka1992] Update 801009e [kaka1992] Update 76d6346 [kaka1992] [SPARK-7321][SQL] Add Column expression for conditional statements (if, case)
* [SPARK-7572] [MLLIB] do not import Param/Params under pyspark.mlXiangrui Meng2015-05-123-7/+11
| | | | | | | | | | Remove `Param` and `Params` from `pyspark.ml` and add a section in the doc. brkyvz Author: Xiangrui Meng <meng@databricks.com> Closes #6094 from mengxr/SPARK-7572 and squashes the following commits: 022abd6 [Xiangrui Meng] do not import Param/Params under spark.ml
* [SPARK-7487] [ML] Feature Parity in PySpark for ml.regressionBurak Yavuz2015-05-126-8/+709
| | | | | | | | | | | | | | | | Added LinearRegression Python API Author: Burak Yavuz <brkyvz@gmail.com> Closes #6016 from brkyvz/ml-reg and squashes the following commits: 11c9ef9 [Burak Yavuz] address comments 1027a40 [Burak Yavuz] fix typo 4c699ad [Burak Yavuz] added tree regressor api 8afead2 [Burak Yavuz] made mixin for DT fa51c74 [Burak Yavuz] save additions 0640d48 [Burak Yavuz] added ml.regression 82aac48 [Burak Yavuz] added linear regression
* [SPARK-6876] [PySpark] [SQL] add DataFrame na.replace in pysparkDaoyuan Wang2015-05-122-0/+133
| | | | | | | | | | | | Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #6003 from adrian-wang/pynareplace and squashes the following commits: 672efba [Daoyuan Wang] remove py2.7 feature 4a148f7 [Daoyuan Wang] to_replace support dict, value support single value, and add full tests 9e232e7 [Daoyuan Wang] rename scala map af0268a [Daoyuan Wang] remove na 63ac579 [Daoyuan Wang] add na.replace in pyspark