| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Vinod K C <vinod.kc@huawei.com>
Closes #5974 from vinodkc/fix_countApproxDistinct_Validation and squashes the following commits:
3a3d59c [Vinod K C] Reverted removal of validation relativeSD<0.000017
799976e [Vinod K C] Removed testcase to assert IAE when relativeSD>3.7
8ddbfae [Vinod K C] Remove blank line
b1b00a3 [Vinod K C] Removed relativeSD validation from python API,RDD.scala will do validation
122d378 [Vinod K C] Fixed validation of relativeSD in countApproxDistinct
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds Python Api for `ALS` under `ml.recommendation` in PySpark. Also adds seed as a settable parameter in the Scala Implementation of ALS.
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #6015 from brkyvz/ml-rec and squashes the following commits:
be6e931 [Burak Yavuz] addressed comments
eaed879 [Burak Yavuz] readd numFeatures
0bd66b1 [Burak Yavuz] fixed seed
7f6d964 [Burak Yavuz] merged master
52e2bda [Burak Yavuz] added ALS
|
|
|
|
|
|
|
|
|
|
|
| |
Add a Python API for mllib.feature.ChiSqSelector
https://issues.apache.org/jira/browse/SPARK-5913
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #5939 from yanboliang/spark-5913 and squashes the following commits:
cdaac99 [Yanbo Liang] Python API for ChiSqSelector
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's the first step: generalize UnresolvedGetField to support all map, struct, and array
TODO: add `apply` in Scala and `__getitem__` in Python, and unify the `getItem` and `getField` methods to one single API(or should we keep them for compatibility?).
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #5744 from cloud-fan/generalize and squashes the following commits:
715c589 [Wenchen Fan] address comments
7ea5b31 [Wenchen Fan] fix python test
4f0833a [Wenchen Fan] add python test
f515d69 [Wenchen Fan] add apply method and test cases
8df6199 [Wenchen Fan] fix python test
239730c [Wenchen Fan] fix test compile
2a70526 [Wenchen Fan] use _bin_op in dataframe.py
6bf72bc [Wenchen Fan] address comments
3f880c3 [Wenchen Fan] add java doc
ab35ab5 [Wenchen Fan] fix python test
b5961a9 [Wenchen Fan] fix style
c9d85f5 [Wenchen Fan] generalize UnresolvedGetField to support all map, struct, and array
|
|
|
|
|
|
|
|
|
|
|
|
| |
Multiline commands are properly handled in this PR. oefirouz
![screen shot 2015-05-07 at 10 53 25 pm](https://cloud.githubusercontent.com/assets/829644/7531290/02ad2fd4-f50c-11e4-8c04-e58d1a61ad69.png)
Author: Xiangrui Meng <meng@databricks.com>
Closes #6001 from mengxr/SPARK-7474 and squashes the following commits:
b94b11d [Xiangrui Meng] update ParamGridBuilder doctest
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implemented python wrappers for Scala functions that don't exist in `ml.features`
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #5991 from brkyvz/ml-feat-PR and squashes the following commits:
adcca55 [Burak Yavuz] add regex tokenizer to __all__
b91cb44 [Burak Yavuz] addressed comments
bd39fd2 [Burak Yavuz] remove addition
b82bd7c [Burak Yavuz] Parity in PySpark for ml.features
|
|
|
|
|
|
|
|
|
|
|
| |
The compression is based on storage. brkyvz
Author: Xiangrui Meng <meng@databricks.com>
Closes #5985 from mengxr/SPARK-6948 and squashes the following commits:
df56a00 [Xiangrui Meng] update python tests
6d90d45 [Xiangrui Meng] compress vectors in VectorAssembler
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add
1. Class methods squared_dist
3. parse
4. norm
5. numNonzeros
6. copy
I made a few vectorizations wrt squared_dist and dot as well. I have added support for SparseMatrix serialization in a separate PR (https://github.com/apache/spark/pull/5775) and plan to complete support for Matrices in another PR.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #5872 from MechCoder/local_linalg_api and squashes the following commits:
a8ff1e0 [MechCoder] minor
ce3e53e [MechCoder] Add error message for parser
1bd3c04 [MechCoder] Robust parser and removed unnecessary methods
f779561 [MechCoder] [SPARK-7328] Pyspark.mllib.linalg.Vectors: Missing items
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-6093
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #5941 from yanboliang/spark-6093 and squashes the following commits:
6934af3 [Yanbo Liang] change to @property
aac3bc5 [Yanbo Liang] Add RegressionMetrics in PySpark/MLlib
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds a proxy call from PySpark to the Spark SQL coalesce function and this patch comes out of a discussion on devspark with rxin
This contribution is my original work and i license the work to the project under the project's open source license.
Olivier.
Author: Olivier Girardot <o.girardot@lateral-thoughts.com>
Closes #5698 from ogirardot/master and squashes the following commits:
d9a4439 [Olivier Girardot] SPARK-7118 Add the coalesce Spark SQL function available in PySpark
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The wrapper required the implementation of the `ArrayParam`, because `Array[T]` is hard to obtain from Python. `ArrayParam` has an extra function called `wCast` which is an internal function to obtain `Array[T]` from `Seq[T]`
Author: Burak Yavuz <brkyvz@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes #5930 from brkyvz/ml-feat and squashes the following commits:
73e745f [Burak Yavuz] Merge pull request #3 from mengxr/SPARK-7388
c221db9 [Xiangrui Meng] overload StringArrayParam.w
c81072d [Burak Yavuz] addressed comments
99c2ebf [Burak Yavuz] add to python_shared_params
39ecb07 [Burak Yavuz] fix scalastyle
7f7ea2a [Burak Yavuz] [SPARK-7388][SPARK-7383] wrapper for VectorAssembler in Python
|
|
|
|
|
|
|
|
| |
Author: Shiti <ssaxena.ece@gmail.com>
Closes #5867 from Shiti/spark-7295 and squashes the following commits:
71a9913 [Shiti] implementation for bitwise and,or, not and xor on Column with tests and docs
|
|
|
|
|
|
|
|
|
|
| |
Temporarily disable flaky doctest for CrossValidator. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #5962 from mengxr/disable-pyspark-cv-test and squashes the following commits:
5db7e5b [Xiangrui Meng] disable cv doctest
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since CrossValidator is a meta algorithm, we copy the implementation in Python. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #5926 from mengxr/SPARK-6940 and squashes the following commits:
6af181f [Xiangrui Meng] add TODOs
8285134 [Xiangrui Meng] update doc
060f7c3 [Xiangrui Meng] update doctest
acac727 [Xiangrui Meng] add keyword args
cdddecd [Xiangrui Meng] add CrossValidator in Python
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-6267
Author: Yanbo Liang <ybliang8@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes #5890 from yanboliang/spark-6267 and squashes the following commits:
f20541d [Yanbo Liang] Merge pull request #3 from mengxr/SPARK-6267
7f202f9 [Xiangrui Meng] use Vector to have the best Python 2&3 compatibility
4bccfee [Yanbo Liang] fix doctest
ec09412 [Yanbo Liang] fix typos
8214bbb [Yanbo Liang] fix code style
5c8ebe5 [Yanbo Liang] Python API for IsotonicRegression
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After a discussion on the user mailing list, it was decided to put all UDF's under `o.a.s.sql.functions`
cc rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #5923 from brkyvz/move-math-funcs and squashes the following commits:
a8dc3f7 [Burak Yavuz] address comments
cf7a7bb [Burak Yavuz] [SPARK-7358] Move DataFrame mathfunctions into functions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: 云峤 <chensong.cs@alibaba-inc.com>
Author: kaka1992 <kaka_1992@163.com>
Closes #5839 from kaka1992/master and squashes the following commits:
b15360d [kaka1992] Fix python unit test in sql/test. =_= I forget to commit this file last time.
f928816 [kaka1992] Fix python style in sql/test.
d2e7f72 [kaka1992] Fix python style in sql/test.
c54d904 [kaka1992] Fix empty map bug.
7e64d1e [云峤] Update
7b9b858 [云峤] undo
f080f8d [云峤] update pep8
76f0c51 [云峤] Merge remote-tracking branch 'remotes/upstream/master'
7d62368 [云峤] [SPARK-7294] ADD BETWEEN
baf839b [云峤] [SPARK-7294] ADD BETWEEN
d11d5b9 [云峤] [SPARK-7294] ADD BETWEEN
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR adds `BinaryClassificationEvaluator` to Python ML Pipelines API, which is a simple wrapper of the Scala implementation. oefirouz
Author: Xiangrui Meng <meng@databricks.com>
Closes #5885 from mengxr/SPARK-7333 and squashes the following commits:
25d7451 [Xiangrui Meng] fix tests in python 3
babdde7 [Xiangrui Meng] fix doc
cb51e6a [Xiangrui Meng] add BinaryClassificationEvaluator in PySpark
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reduced take size from 1e8 to 1e6.
cc rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #5900 from brkyvz/df-cont-followup and squashes the following commits:
c11e762 [Burak Yavuz] fix grammar
b30ace2 [Burak Yavuz] address comments
a417ba5 [Burak Yavuz] [SPARK-7243][SQL] Reduce size for Contingency Tables in DataFrames
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The following items are added to Python kmeans:
kmeans - setEpsilon, setInitializationSteps
KMeansModel - computeCost, k
Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com>
Closes #5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits:
b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test
5fd3ced [Hrishikesh Subramonian] doc test corrections
20b3c68 [Hrishikesh Subramonian] python 3 fixes
4d4e695 [Hrishikesh Subramonian] added arguments in python tests
21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
|
|
|
|
|
|
|
|
|
|
| |
Utilities for pickling and unpickling SparseMatrices using SerDe
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #5775 from MechCoder/spark-7202 and squashes the following commits:
7e689dc [MechCoder] [SPARK-7202] Add SparseMatrixPickler to SerDe
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Computes a pair-wise frequency table of the given columns. Also known as cross-tabulation.
cc mengxr rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #5842 from brkyvz/df-cont and squashes the following commits:
a07c01e [Burak Yavuz] addressed comments v4.1
ae9e01d [Burak Yavuz] fix test
9106585 [Burak Yavuz] addressed comments v4.0
bced829 [Burak Yavuz] fix merge conflicts
a63ad00 [Burak Yavuz] addressed comments v3.0
a0cad97 [Burak Yavuz] addressed comments v3.0
6805df8 [Burak Yavuz] addressed comments and fixed test
939b7c4 [Burak Yavuz] lint python
7f098bc [Burak Yavuz] add crosstab pyTest
fd53b00 [Burak Yavuz] added python support for crosstab
27a5a81 [Burak Yavuz] implemented crosstab
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: 云峤 <chensong.cs@alibaba-inc.com>
Closes #5865 from kaka1992/df.show and squashes the following commits:
c79204b [云峤] Update
a1338f6 [云峤] Update python dataFrame show test and add empty df unit test.
734369c [云峤] Update python dataFrame show test and add empty df unit test.
84aec3e [云峤] Update python dataFrame show test and add empty df unit test.
159b3d5 [云峤] update
03ef434 [云峤] update
7394fd5 [云峤] update test show
ced487a [云峤] update pep8
b6e690b [云峤] Merge remote-tracking branch 'upstream/master' into df.show
30ac311 [云峤] [SPARK-7294] ADD BETWEEN
7d62368 [云峤] [SPARK-7294] ADD BETWEEN
baf839b [云峤] [SPARK-7294] ADD BETWEEN
d11d5b9 [云峤] [SPARK-7294] ADD BETWEEN
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
submitting this PR from a phone, excuse the brevity.
adds Pearson correlation to Dataframes, reusing the covariance calculation code
cc mengxr rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #5858 from brkyvz/df-corr and squashes the following commits:
285b838 [Burak Yavuz] addressed comments v2.0
d10babb [Burak Yavuz] addressed comments v0.2
4b74b24 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-corr
4fe693b [Burak Yavuz] addressed comments v0.1
a682d06 [Burak Yavuz] ready for PR
|
|
|
|
|
|
|
|
|
|
|
| |
as suggested by justinuang on #5601.
Author: Xiangrui Meng <meng@databricks.com>
Closes #5873 from mengxr/SPARK-7329 and squashes the following commits:
d08f9cf [Xiangrui Meng] simplify tests
b7a7b9b [Xiangrui Meng] simplify grid build
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Omede Firouz <ofirouz@palantir.com>
Author: Omede <omedefirouz@gmail.com>
Closes #5601 from oefirouz/paramgrid and squashes the following commits:
c9e2481 [Omede Firouz] Make test a doctest
9a8ce22 [Omede] Fix linter issues
8b8a6d2 [Omede Firouz] [SPARK-7022][PySpark][ML] Add ML.Tuning.ParamGridBuilder to PySpark
|
|
|
|
|
|
|
|
| |
Author: Dean Chen <deanchen5@gmail.com>
Closes #5866 from deanchen/patch-1 and squashes the following commits:
0028bc4 [Dean Chen] Fix typo in Dataframes.py introduced in [SPARK-3444]
|
|
|
|
|
|
|
|
|
|
|
|
| |
The python api for DataFrame's plus addressed your comments from previous PR.
rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #5859 from brkyvz/df-freq-py2 and squashes the following commits:
f9aa9ce [Burak Yavuz] addressed comments v0.1
4b25056 [Burak Yavuz] added python api for freqItems
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add support for changing the log level at run time through the SparkContext. Based on an earlier PR, #2433 includes CR feedback from pwendel & davies
Author: Holden Karau <holden@pigscanfly.ca>
Closes #5791 from holdenk/SPARK-3444-provide-an-easy-way-to-change-log-level-r2 and squashes the following commits:
3bf3be9 [Holden Karau] fix exception
42ba873 [Holden Karau] fix exception
9117244 [Holden Karau] Only allow valid log levels, throw exception if invalid log level.
338d7bf [Holden Karau] rename setLoggingLevel to setLogLevel
fac14a0 [Holden Karau] Fix style errors
d9d03f3 [Holden Karau] Add support for changing the log level at run time through the SparkContext. Based on an earlier PR, #2433 includes CR feedback from @pwendel & @davies
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
i don't think this should be merged until after 1.3.0 is final
Author: cody koeninger <cody@koeninger.org>
Author: Helena Edelson <helena.edelson@datastax.com>
Closes #4537 from koeninger/wip-2808-kafka-0.8.2-upgrade and squashes the following commits:
803aa2c [cody koeninger] [SPARK-2808][Streaming][Kafka] code cleanup per TD
e6dfaf6 [cody koeninger] [SPARK-2808][Streaming][Kafka] pointless whitespace change to trigger jenkins again
1770abc [cody koeninger] [SPARK-2808][Streaming][Kafka] make waitUntilLeaderOffset easier to call, call it from python tests as well
d4267e9 [cody koeninger] [SPARK-2808][Streaming][Kafka] fix stderr redirect in python test script
30d991d [cody koeninger] [SPARK-2808][Streaming][Kafka] remove stderr prints since it breaks python 3 syntax
1d896e2 [cody koeninger] [SPARK-2808][Streaming][Kafka] add even even more logging to python test
4c4557f [cody koeninger] [SPARK-2808][Streaming][Kafka] add even more logging to python test
115aeee [cody koeninger] Merge branch 'master' into wip-2808-kafka-0.8.2-upgrade
2712649 [cody koeninger] [SPARK-2808][Streaming][Kafka] add more logging to python test, see why its timing out in jenkins
2b92d3f [cody koeninger] [SPARK-2808][Streaming][Kafka] wait for leader offsets in the java test as well
3824ce3 [cody koeninger] [SPARK-2808][Streaming][Kafka] naming / comments per tdas
61b3464 [cody koeninger] [SPARK-2808][Streaming][Kafka] delay for second send in boundary condition test
af6f3ec [cody koeninger] [SPARK-2808][Streaming][Kafka] delay test until latest leader offset matches expected value
9edab4c [cody koeninger] [SPARK-2808][Streaming][Kafka] more shots in the dark on jenkins failing test
c70ee43 [cody koeninger] [SPARK-2808][Streaming][Kafka] add more asserts to test, try to figure out why it fails on jenkins but not locally
1d10751 [cody koeninger] Merge branch 'master' into wip-2808-kafka-0.8.2-upgrade
ed02d2c [cody koeninger] [SPARK-2808][Streaming][Kafka] move default argument for api version to overloaded method, for binary compat
407382e [cody koeninger] [SPARK-2808][Streaming][Kafka] update kafka to 0.8.2.1
77de6c2 [cody koeninger] Merge branch 'master' into wip-2808-kafka-0.8.2-upgrade
6953429 [cody koeninger] [SPARK-2808][Streaming][Kafka] update kafka to 0.8.2
2e67c66 [Helena Edelson] #SPARK-2808 Update to Kafka 0.8.2.0 GA from beta.
d9dc2bc [Helena Edelson] Merge remote-tracking branch 'upstream/master' into wip-2808-kafka-0.8.2-upgrade
e768164 [Helena Edelson] #2808 update kafka to version 0.8.2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added the calculation of covariance between two columns to DataFrames.
cc mengxr rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #5825 from brkyvz/df-cov and squashes the following commits:
cb18046 [Burak Yavuz] changed to sample covariance
f2e862b [Burak Yavuz] fixed failed test
51e39b8 [Burak Yavuz] moved implementation
0c6a759 [Burak Yavuz] addressed math comments
8456eca [Burak Yavuz] fix pyStyle3
aa2ad29 [Burak Yavuz] fix pyStyle2
4e97a50 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-cov
e3b0b85 [Burak Yavuz] addressed comments v0.1
a7115f1 [Burak Yavuz] fix python style
7dc6dbc [Burak Yavuz] reorder imports
408cb77 [Burak Yavuz] initial commit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #5802 from rxin/SPARK-7274 and squashes the following commits:
19aecaa [Reynold Xin] Fixed unicode tests.
bfc1538 [Reynold Xin] Export all Python functions.
2517b8c [Reynold Xin] Code review.
23da335 [Reynold Xin] Fixed Python bug.
132002e [Reynold Xin] Fixed tests.
56fce26 [Reynold Xin] Added Python support.
b0d591a [Reynold Xin] Fixed debug error.
86926a6 [Reynold Xin] Added test suite.
7dbb9ab [Reynold Xin] Ok one more.
470e2f5 [Reynold Xin] One more MLlib ...
e2d14f0 [Reynold Xin] [SPARK-7274][SQL] Create Column expression for array/struct creation.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds
rank, recommendUsers and RecommendProducts to MatrixFactorizationModel in PySpark.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #5807 from MechCoder/spark-6257 and squashes the following commits:
09629c6 [MechCoder] doc
953b326 [MechCoder] [SPARK-6257] MLlib API missing items in Recommendation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds the functions `rand` (Uniform Dist) and `randn` (Normal Dist.) as expressions to DataFrames.
cc mengxr rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #5819 from brkyvz/df-rng and squashes the following commits:
50d69d4 [Burak Yavuz] add seed for test that failed
4234c3a [Burak Yavuz] fix Rand expression
13cad5c [Burak Yavuz] couple fixes
7d53953 [Burak Yavuz] waiting for hive tests
b453716 [Burak Yavuz] move radn with seed down
03637f0 [Burak Yavuz] fix broken hive func
c5909eb [Burak Yavuz] deleted old implementation of Rand
6d43895 [Burak Yavuz] implemented random generators
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
small fixes regarding comments in PR #5761
cc rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #5795 from brkyvz/split-followup and squashes the following commits:
369c522 [Burak Yavuz] changed wording a little
1ea456f [Burak Yavuz] Addressed follow up comments
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is built on top of kaka1992 's PR #5711 using Logical plans.
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #5761 from brkyvz/random-sample and squashes the following commits:
a1fb0aa [Burak Yavuz] remove unrelated file
69669c3 [Burak Yavuz] fix broken test
1ddb3da [Burak Yavuz] copy base
6000328 [Burak Yavuz] added python api and fixed test
3c11d1b [Burak Yavuz] fixed broken test
f400ade [Burak Yavuz] fix build errors
2384266 [Burak Yavuz] addressed comments v0.1
e98ebac [Burak Yavuz] [SPARK-7156][SQL] support RandomSplit in DataFrames
|
|
|
|
|
|
|
|
|
|
| |
I believe column access via `__getattr__` is bad and shouldn't be implicitly encouraged by the error message when accessing a non-existing attribute on DataFrame. This patch changes the error message from 'no such column' to the more generic 'no such attribute', which is also what Pandas DFs will throw.
Author: ksonj <kson@siberie.de>
Closes #5771 from ksonj/master and squashes the following commits:
bcc2220 [ksonj] Better error message on access to non-existing attribute
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds SQL to the set of excluded libraries when
generating a callSite. This makes the callSite mechanism work
properly for the data frame API. I also added a small improvement for
JDBC queries where we just use the string "Spark JDBC Server Query"
instead of trying to give a callsite that doesn't make any sense
to the user.
Before (DF):
![screen shot 2015-04-28 at 1 29 26 pm](https://cloud.githubusercontent.com/assets/320616/7380170/ef63bfb0-edae-11e4-989c-f88a5ba6bbee.png)
After (DF):
![screen shot 2015-04-28 at 1 34 58 pm](https://cloud.githubusercontent.com/assets/320616/7380181/fa7f6d90-edae-11e4-9559-26f163ed63b8.png)
After (JDBC):
![screen shot 2015-04-28 at 2 00 10 pm](https://cloud.githubusercontent.com/assets/320616/7380185/02f5b2a4-edaf-11e4-8e5b-99bdc3df66dd.png)
Author: Patrick Wendell <patrick@databricks.com>
Closes #5757 from pwendell/dataframes and squashes the following commits:
0d931a4 [Patrick Wendell] Attempting to fix PySpark tests
85bf740 [Patrick Wendell] [SPARK-7204] Fix callsite for dataframe operations.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds support for the math functions for DataFrames in PySpark.
rxin I love Davies.
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #5750 from brkyvz/python-math-udfs and squashes the following commits:
7c4f563 [Burak Yavuz] removed is_math
3c4adde [Burak Yavuz] cleanup imports
d5dca3f [Burak Yavuz] moved math functions to mathfunctions
25e6534 [Burak Yavuz] addressed comments v2.0
d3f7e0f [Burak Yavuz] addressed comments and added tests
7b7d7c4 [Burak Yavuz] remove tests for removed methods
33c2c15 [Burak Yavuz] fixed python style
3ee0c05 [Burak Yavuz] added python functions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
linalg.py
Added Matrix, SparseMatrix to __all__ list in linalg.py
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #5759 from jkbradley/SPARK-7208 and squashes the following commits:
deb51a2 [Joseph K. Bradley] Added Matrix, SparseMatrix to __all__ list in linalg.py
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #5709 from rxin/inc-id and squashes the following commits:
7853611 [Reynold Xin] private sql.
a9fda0d [Reynold Xin] Missed a few numbers.
343d896 [Reynold Xin] Self review feedback.
a7136cb [Reynold Xin] [SPARK-7135][SQL] DataFrame expression for monotonically increasing IDs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently only added `createDirectStream` API, I'm not sure if `createRDD` is also needed, since some Java object needs to be wrapped in Python. Please help to review, thanks a lot.
Author: jerryshao <saisai.shao@intel.com>
Author: Saisai Shao <saisai.shao@intel.com>
Closes #4723 from jerryshao/direct-kafka-python-api and squashes the following commits:
a1fe97c [jerryshao] Fix rebase issue
eebf333 [jerryshao] Address the comments
da40f4e [jerryshao] Fix Python 2.6 Syntax error issue
5c0ee85 [jerryshao] Style fix
4aeac18 [jerryshao] Fix bug in example code
7146d86 [jerryshao] Add unit test
bf3bdd6 [jerryshao] Add more APIs and address the comments
f5b3801 [jerryshao] Small style fix
8641835 [Saisai Shao] Rebase and update the code
589c05b [Saisai Shao] Fix the style
d6fcb6a [Saisai Shao] Address the comments
dfda902 [Saisai Shao] Style fix
0f7d168 [Saisai Shao] Add the doc and fix some style issues
67e6880 [Saisai Shao] Fix test bug
917b0db [Saisai Shao] Add Python createRDD API for Kakfa direct stream
c3fc11d [jerryshao] Modify the docs
2c00936 [Saisai Shao] address the comments
3360f44 [jerryshao] Fix code style
e0e0f0d [jerryshao] Code clean and bug fix
338c41f [Saisai Shao] Add python API and example for direct kafka stream
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #5705 from rxin/df-pid and squashes the following commits:
401018f [Reynold Xin] [SPARK-7152][SQL] Add a Column expression for partition ID.
|
|
|
|
|
|
|
|
|
|
| |
This pr tries to provide a way to let python users workaround https://issues.apache.org/jira/browse/SPARK-6231.
Author: Yin Huai <yhuai@databricks.com>
Closes #5634 from yhuai/pythonDFAlias and squashes the following commits:
8465acd [Yin Huai] Add an alias to a Python DF.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
with Java API
Make PySpark ```FPGrowthModel.freqItemsets``` consistent with Java/Scala API like ```MatrixFactorizationModel.userFeatures```
It return a RDD with each tuple is composed of an array and a long value.
I think it's difficult to implement namedtuples to wrap the output because items of freqItemsets can be any type with arbitrary length which is tedious to impelement corresponding SerDe function.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #5614 from yanboliang/spark-6827 and squashes the following commits:
da8c404 [Yanbo Liang] use namedtuple
5532e78 [Yanbo Liang] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API
|
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #5638 from rxin/joinUsing and squashes the following commits:
13e9cc9 [Reynold Xin] Code review + Python.
b1bd914 [Reynold Xin] [SPARK-7059][SQL] Create a DataFrame join API to facilitate equijoin and self join.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR try to speed up some python tests:
```
tests.py 144s -> 103s -41s
mllib/classification.py 24s -> 17s -7s
mllib/regression.py 27s -> 15s -12s
mllib/tree.py 27s -> 13s -14s
mllib/tests.py 64s -> 31s -33s
streaming/tests.py 185s -> 84s -101s
```
Considering python3, the total saving will be 558s (almost 10 minutes) (core, and streaming run three times, mllib runs twice).
During testing, it will show used time for each test file:
```
Run core tests ...
Running test: pyspark/rdd.py ... ok (22s)
Running test: pyspark/context.py ... ok (16s)
Running test: pyspark/conf.py ... ok (4s)
Running test: pyspark/broadcast.py ... ok (4s)
Running test: pyspark/accumulators.py ... ok (4s)
Running test: pyspark/serializers.py ... ok (6s)
Running test: pyspark/profiler.py ... ok (5s)
Running test: pyspark/shuffle.py ... ok (1s)
Running test: pyspark/tests.py ... ok (103s) 144s
```
Author: Reynold Xin <rxin@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes #5605 from rxin/python-tests-speed and squashes the following commits:
d08542d [Reynold Xin] Merge pull request #14 from mengxr/SPARK-6953
89321ee [Xiangrui Meng] fix seed in tests
3ad2387 [Reynold Xin] Merge pull request #5427 from davies/python_tests
|
|
|
|
|
|
|
|
|
|
| |
SchemaRDD works with ALS.train in 1.2, so we should continue support DataFrames for compatibility. coderxiang
Author: Xiangrui Meng <meng@databricks.com>
Closes #5619 from mengxr/SPARK-7036 and squashes the following commits:
dfcaf5a [Xiangrui Meng] ALS.train should support DataFrames in PySpark
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since sparse matrices now support a isTransposed flag for row major data, DenseMatrices should do the same.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #5455 from MechCoder/spark-6845 and squashes the following commits:
525c370 [MechCoder] minor
004a37f [MechCoder] Cast boolean to int
151f3b6 [MechCoder] [WIP] Add isTransposed to pickle DenseMatrix
cc0b90a [MechCoder] [SPARK-6845] Add isTranposed flag to DenseMatrix
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR enable auto_convert in JavaGateway, then we could register a converter for a given types, for example, date and datetime.
There are two bugs related to auto_convert, see [1] and [2], we workaround it in this PR.
[1] https://github.com/bartdag/py4j/issues/160
[2] https://github.com/bartdag/py4j/issues/161
cc rxin JoshRosen
Author: Davies Liu <davies@databricks.com>
Closes #5570 from davies/py4j_date and squashes the following commits:
eb4fa53 [Davies Liu] fix tests in python 3
d17d634 [Davies Liu] rollback changes in mllib
2e7566d [Davies Liu] convert tuple into ArrayList
ceb3779 [Davies Liu] Update rdd.py
3c373f3 [Davies Liu] support date and datetime by auto_convert
cb094ff [Davies Liu] enable auto convert
|