aboutsummaryrefslogtreecommitdiff
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* [MINOR][ML] Use coefficients replace weightsYanbo Liang2015-12-032-2/+2
| | | | | | | | | Use ```coefficients``` replace ```weights```, I wish they are the last two. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #10065 from yanboliang/coefficients.
* [SPARK-12090] [PYSPARK] consider shuffle in coalesce()Davies Liu2015-12-011-1/+1
| | | | | | Author: Davies Liu <davies@databricks.com> Closes #10090 from davies/fix_coalesce.
* [SPARK-12002][STREAMING][PYSPARK] Fix python direct stream checkpoint ↵jerryshao2015-12-012-6/+56
| | | | | | | | | | | | | recovery issue Fixed a minor race condition in #10017 Closes #10017 Author: jerryshao <sshao@hortonworks.com> Author: Shixiong Zhu <shixiong@databricks.com> Closes #10074 from zsxwing/review-pr10017.
* [SPARK-12058][HOTFIX] Disable KinesisStreamTestsShixiong Zhu2015-11-301-0/+1
| | | | | | | | | | | | KinesisStreamTests in test.py is broken because of #9403. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46896/testReport/(root)/KinesisStreamTests/test_kinesis_stream/ Because Streaming Python didn’t work when merging https://github.com/apache/spark/pull/9403, the PR build didn’t report the Python test failure actually. This PR just disabled the test to unblock #10039 Author: Shixiong Zhu <shixiong@databricks.com> Closes #10047 from zsxwing/disable-python-kinesis-test.
* [SPARK-11917][PYSPARK] Add SQLContext#dropTempTable to PySparkJeff Zhang2015-11-261-0/+9
| | | | | | Author: Jeff Zhang <zjffdu@apache.org> Closes #9903 from zjffdu/SPARK-11917.
* [SPARK-11980][SPARK-10621][SQL] Fix json_tuple and add test cases forgatorsmile2015-11-251-10/+34
| | | | | | | | | | | | | | Added Python test cases for the function `isnan`, `isnull`, `nanvl` and `json_tuple`. Fixed a bug in the function `json_tuple` rxin , could you help me review my changes? Please let me know anything is missing. Thank you! Have a good Thanksgiving day! Author: gatorsmile <gatorsmile@gmail.com> Closes #9977 from gatorsmile/json_tuple.
* [SPARK-11935][PYSPARK] Send the Python exceptions in TransformFunction and ↵Shixiong Zhu2015-11-252-10/+101
| | | | | | | | | | | | TransformFunctionSerializer to Java The Python exception track in TransformFunction and TransformFunctionSerializer is not sent back to Java. Py4j just throws a very general exception, which is hard to debug. This PRs adds `getFailure` method to get the failure message in Java side. Author: Shixiong Zhu <shixiong@databricks.com> Closes #9922 from zsxwing/SPARK-11935.
* [SPARK-11969] [SQL] [PYSPARK] visualization of SQL query for pysparkDavies Liu2015-11-251-1/+1
| | | | | | | | | | Currently, we does not have visualization for SQL query from Python, this PR fix that. cc zsxwing Author: Davies Liu <davies@databricks.com> Closes #9949 from davies/pyspark_sql_ui.
* [SPARK-11984][SQL][PYTHON] Fix typos in doc for pivot for scala and pythonfelixcheung2015-11-251-3/+3
| | | | | | Author: felixcheung <felixcheung_m@hotmail.com> Closes #9967 from felixcheung/pypivotdoc.
* [SPARK-11860][PYSAPRK][DOCUMENTATION] Invalid argument specification …Jeff Zhang2015-11-251-2/+3
| | | | | | | | | | …for registerFunction [Python] Straightforward change on the python doc Author: Jeff Zhang <zjffdu@apache.org> Closes #9901 from zjffdu/SPARK-11860.
* [SPARK-10621][SQL] Consistent naming for functions in SQL, Python, ScalaReynold Xin2015-11-241-17/+94
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9948 from rxin/SPARK-10621.
* [SPARK-11967][SQL] Consistent use of varargs for multiple paths in ↵Reynold Xin2015-11-241-7/+12
| | | | | | | | | | | | DataFrameReader This patch makes it consistent to use varargs in all DataFrameReader methods, including Parquet, JSON, text, and the generic load function. Also added a few more API tests for the Java API. Author: Reynold Xin <rxin@databricks.com> Closes #9945 from rxin/SPARK-11967.
* [SPARK-11946][SQL] Audit pivot API for 1.6.Reynold Xin2015-11-241-4/+8
| | | | | | | | | | | | | | | | | | | | Currently pivot's signature looks like ```scala scala.annotation.varargs def pivot(pivotColumn: Column, values: Column*): GroupedData scala.annotation.varargs def pivot(pivotColumn: String, values: Any*): GroupedData ``` I think we can remove the one that takes "Column" types, since callers should always be passing in literals. It'd also be more clear if the values are not varargs, but rather Seq or java.util.List. I also made similar changes for Python. Author: Reynold Xin <rxin@databricks.com> Closes #9929 from rxin/SPARK-11946.
* [SPARK-10560][PYSPARK][MLLIB][DOCS] Make StreamingLogisticRegressionWithSGD ↵Bryan Cutler2015-11-232-23/+46
| | | | | | | | | | | | | | Python API equal to Scala one This is to bring the API documentation of StreamingLogisticReressionWithSGD and StreamingLinearRegressionWithSGC in line with the Scala versions. -Fixed the algorithm descriptions -Added default values to parameter descriptions -Changed StreamingLogisticRegressionWithSGD regParam to default to 0, as in the Scala version Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #9141 from BryanCutler/StreamingLogisticRegressionWithSGD-python-api-sync.
* [SPARK-11836][SQL] udf/cast should not create new SQLContextDavies Liu2015-11-232-6/+8
| | | | | | | | They should use the existing SQLContext. Author: Davies Liu <davies@databricks.com> Closes #9914 from davies/create_udf.
* [SPARK-11870][STREAMING][PYSPARK] Rethrow the exceptions in ↵Shixiong Zhu2015-11-202-0/+19
| | | | | | | | | | TransformFunction and TransformFunctionSerializer TransformFunction and TransformFunctionSerializer don't rethrow the exception, so when any exception happens, it just return None. This will cause some weird NPE and confuse people. Author: Shixiong Zhu <shixiong@databricks.com> Closes #9847 from zsxwing/pyspark-streaming-exception.
* [SPARK-11875][ML][PYSPARK] Update doc for PySpark HasCheckpointIntervalYanbo Liang2015-11-192-9/+11
| | | | | | | | | * Update doc for PySpark ```HasCheckpointInterval``` that users can understand how to disable checkpoint. * Update doc for PySpark ```cacheNodeIds``` of ```DecisionTreeParams``` to notify the relationship between ```cacheNodeIds``` and ```checkpointInterval```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9856 from yanboliang/spark-11875.
* [SPARK-11812][PYSPARK] invFunc=None works properly with python's ↵David Tolpin2015-11-192-3/+14
| | | | | | | | | | | | | reduceByKeyAndWindow invFunc is optional and can be None. Instead of invFunc (the parameter) invReduceFunc (a local function) was checked for trueness (that is, not None, in this context). A local function is never None, thus the case of invFunc=None (a common one when inverse reduction is not defined) was treated incorrectly, resulting in loss of data. In addition, the docstring used wrong parameter names, also fixed. Author: David Tolpin <david.tolpin@gmail.com> Closes #9775 from dtolpin/master.
* [SPARK-11820][ML][PYSPARK] PySpark LiR & LoR should support weightColYanbo Liang2015-11-182-16/+17
| | | | | | | | [SPARK-7685](https://issues.apache.org/jira/browse/SPARK-7685) and [SPARK-9642](https://issues.apache.org/jira/browse/SPARK-9642) have already supported setting weight column for ```LogisticRegression``` and ```LinearRegression```. It's a very important feature, PySpark should also support. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9811 from yanboliang/spark-11820.
* [SPARK-11720][SQL][ML] Handle edge cases when count = 0 or 1 for Stats functionJihongMa2015-11-181-1/+1
| | | | | | | | return Double.NaN for mean/average when count == 0 for all numeric types that is converted to Double, Decimal type continue to return null. Author: JihongMa <linlin200605@gmail.com> Closes #9705 from JihongMA/SPARK-11720.
* [SPARK-11804] [PYSPARK] Exception raise when using Jdbc predicates opt…Jeff Zhang2015-11-182-5/+18
| | | | | | | | …ion in PySpark Author: Jeff Zhang <zjffdu@apache.org> Closes #9791 from zjffdu/SPARK-11804.
* [SPARK-9065][STREAMING][PYSPARK] Add MessageHandler for Kafka Python APIjerryshao2015-11-172-12/+134
| | | | | | | | | | | | Fixed the merge conflicts in #7410 Closes #7410 Author: Shixiong Zhu <shixiong@databricks.com> Author: jerryshao <saisai.shao@intel.com> Author: jerryshao <sshao@hortonworks.com> Closes #9742 from zsxwing/pr7410.
* [SPARK-11740][STREAMING] Fix the race condition of two checkpoints in a batchShixiong Zhu2015-11-171-5/+4
| | | | | | | | We will do checkpoint when generating a batch and completing a batch. When the processing time of a batch is greater than the batch interval, checkpointing for completing an old batch may run after checkpointing for generating a new batch. If this happens, checkpoint of an old batch actually has the latest information, so we want to recovery from it. This PR will use the latest checkpoint time as the file name, so that we can always recovery from the latest checkpoint file. Author: Shixiong Zhu <shixiong@databricks.com> Closes #9707 from zsxwing/fix-checkpoint.
* [SPARK-6328][PYTHON] Python API for StreamingListenerDaniel Jalova2015-11-164-2/+210
| | | | | | Author: Daniel Jalova <djalova@us.ibm.com> Closes #9186 from djalova/SPARK-6328.
* [SPARK-11745][SQL] Enable more JSON parsing optionsReynold Xin2015-11-161-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | This patch adds the following options to the JSON data source, for dealing with non-standard JSON files: * `allowComments` (default `false`): ignores Java/C++ style comment in JSON records * `allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names * `allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes * `allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers (e.g. 00012) To avoid passing a lot of options throughout the json package, I introduced a new JSONOptions case class to define all JSON config options. Also updated documentation to explain these options. Scala ![screen shot 2015-11-15 at 6 12 12 pm](https://cloud.githubusercontent.com/assets/323388/11172965/e3ace6ec-8bc4-11e5-805e-2d78f80d0ed6.png) Python ![screen shot 2015-11-15 at 6 11 28 pm](https://cloud.githubusercontent.com/assets/323388/11172964/e23ed6ee-8bc4-11e5-8216-312f5983acd5.png) Author: Reynold Xin <rxin@databricks.com> Closes #9724 from rxin/SPARK-11745.
* [SPARK-11690][PYSPARK] Add pivot to python apiAndrew Ray2015-11-131-1/+23
| | | | | | | | This PR adds pivot to the python api of GroupedData with the same syntax as Scala/Java. Author: Andrew Ray <ray.andrew@gmail.com> Closes #9653 from aray/sql-pivot-python.
* [SPARK-11706][STREAMING] Fix the bug that Streaming Python tests cannot ↵Shixiong Zhu2015-11-131-10/+20
| | | | | | | | | | report failures This PR just checks the test results and returns 1 if the test fails, so that `run-tests.py` can mark it fail. Author: Shixiong Zhu <shixiong@databricks.com> Closes #9669 from zsxwing/streaming-python-tests.
* [SPARK-11658] simplify documentation for PySpark combineByKeyChris Snow2015-11-121-1/+0
| | | | | | Author: Chris Snow <chsnow123@gmail.com> Closes #9640 from snowch/patch-3.
* [SPARK-11671] documentation code example typoChris Snow2015-11-121-1/+1
| | | | | | | | Example for sqlContext.createDataDrame from pandas.DataFrame has a typo Author: Chris Snow <chsnow123@gmail.com> Closes #9639 from snowch/patch-2.
* [SPARK-11420] Updating Stddev support via Imperative AggregateJihongMa2015-11-121-1/+1
| | | | | | | | switched stddev support from DeclarativeAggregate to ImperativeAggregate. Author: JihongMa <linlin200605@gmail.com> Closes #9380 from JihongMA/SPARK-11420.
* [SPARK-11463] [PYSPARK] only install signal in main threadDavies Liu2015-11-101-1/+4
| | | | | | | | Only install signal in main thread, or it will fail to create context in not-main thread. Author: Davies Liu <davies@databricks.com> Closes #9574 from davies/python_signal.
* [SPARK-11566] [MLLIB] [PYTHON] Refactoring GaussianMixtureModel.gaussians in ↵Yu ISHIKAWA2015-11-101-1/+1
| | | | | | | | | | Python cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9534 from yu-iskw/SPARK-11566.
* [SPARK-11567] [PYTHON] Add Python API for corr Aggregate functionfelixcheung2015-11-101-0/+16
| | | | | | | | | | like `df.agg(corr("col1", "col2")` davies Author: felixcheung <felixcheung_m@hotmail.com> Closes #9536 from felixcheung/pyfunc.
* [SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to ↵Yin Huai2015-11-103-3/+3
| | | | | | | | | | | | | | | | | | | evaluate AggregateExpression1s https://issues.apache.org/jira/browse/SPARK-9830 This PR contains the following main changes. * Removing `AggregateExpression1`. * Removing `Aggregate` operator, which is used to evaluate `AggregateExpression1`. * Removing planner rule used to plan `Aggregate`. * Linking `MultipleDistinctRewriter` to analyzer. * Renaming `AggregateExpression2` to `AggregateExpression` and `AggregateFunction2` to `AggregateFunction`. * Updating places where we create aggregate expression. The way to create aggregate expressions is `AggregateExpression(aggregateFunction, mode, isDistinct)`. * Changing `val`s in `DeclarativeAggregate`s that touch children of this function to `lazy val`s (when we create aggregate expression in DataFrame API, children of an aggregate function can be unresolved). Author: Yin Huai <yhuai@databricks.com> Closes #9556 from yhuai/removeAgg1.
* [SPARK-11610][MLLIB][PYTHON][DOCS] Make the docs of LDAModel.describeTopics ↵Yu ISHIKAWA2015-11-091-0/+6
| | | | | | | | | | in Python more specific cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9577 from yu-iskw/SPARK-11610.
* [SPARK-9301][SQL] Add collect_set and collect_list aggregate functionsNick Buroojy2015-11-092-11/+31
| | | | | | | | | | | | | | | | | | | | | For now they are thin wrappers around the corresponding Hive UDAFs. One limitation with these in Hive 0.13.0 is they only support aggregating primitive types. I chose snake_case here instead of camelCase because it seems to be used in the majority of the multi-word fns. Do we also want to add these to `functions.py`? This approach was recommended here: https://github.com/apache/spark/pull/8592#issuecomment-154247089 marmbrus rxin Author: Nick Buroojy <nick.buroojy@civitaslearning.com> Closes #9526 from nburoojy/nick/udaf-alias. (cherry picked from commit a6ee4f989d020420dd08b97abb24802200ff23b2) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-10280][MLLIB][PYSPARK][DOCS] Add @since annotation to ↵Yu ISHIKAWA2015-11-091-0/+56
| | | | | | | | pyspark.ml.classification Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8690 from yu-iskw/SPARK-10280.
* [SPARK-8467] [MLLIB] [PYSPARK] Add LDAModel.describeTopics() in PythonYu ISHIKAWA2015-11-061-15/+18
| | | | | | | | | | | | | Could jkbradley and davies review it? - Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it. - Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`. [[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8643 from yu-iskw/SPARK-8467-2.
* [HOTFIX] Fix python tests after #9527Michael Armbrust2015-11-061-1/+1
| | | | | | | | #9527 missed updating the python tests. Author: Michael Armbrust <michael@databricks.com> Closes #9533 from marmbrus/hotfixTextValue.
* [SPARK-11410] [PYSPARK] Add python bindings for repartition and sortW…Nong Li2015-11-061-16/+101
| | | | | | | | …ithinPartitions. Author: Nong Li <nong@databricks.com> Closes #9504 from nongli/spark-11410.
* [SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bitsImran Rashid2015-11-064-18/+18
| | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-10116 This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`. mengxr mkolod Author: Imran Rashid <irashid@cloudera.com> Closes #8314 from squito/SPARK-10116.
* [SPARK-11473][ML] R-like summary statistics with intercept for OLS via ↵Yanbo Liang2015-11-051-8/+8
| | | | | | | | | | normal equation solver Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9485 from yanboliang/spark-11473.
* [SPARK-11527][ML][PYSPARK] PySpark AFTSurvivalRegressionModel should expose ↵Yanbo Liang2015-11-051-0/+24
| | | | | | | | | | coefficients/intercept/scale PySpark ```AFTSurvivalRegressionModel``` should expose coefficients/intercept/scale. mengxr vectorijk Author: Yanbo Liang <ybliang8@gmail.com> Closes #9492 from yanboliang/spark-11527.
* [SPARK-11378][STREAMING] make StreamingContext.awaitTerminationOrTimeout ↵Nick Evans2015-11-052-1/+8
| | | | | | | | | | | | return properly This adds a failing test checking that `awaitTerminationOrTimeout` returns the expected value, and then fixes that failing test with the addition of a `return`. tdas zsxwing Author: Nick Evans <me@nicolasevans.org> Closes #9336 from manygrams/fix_await_termination_or_timeout.
* [SPARK-10028][MLLIB][PYTHON] Add Python API for PrefixSpanYu ISHIKAWA2015-11-041-1/+68
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9469 from yu-iskw/SPARK-10028.
* [SPARK-11489][SQL] Only include common first order statistics in GroupedDataReynold Xin2015-11-031-88/+0
| | | | | | | | | | | | | | | | | | We added a bunch of higher order statistics such as skewness and kurtosis to GroupedData. I don't think they are common enough to justify being listed, since users can always use the normal statistics aggregate functions. That is to say, after this change, we won't support ```scala df.groupBy("key").kurtosis("colA", "colB") ``` However, we will still support ```scala df.groupBy("key").agg(kurtosis(col("colA")), kurtosis(col("colB"))) ``` Author: Reynold Xin <rxin@databricks.com> Closes #9446 from rxin/SPARK-11489.
* [SPARK-11467][SQL] add Python API for stddev/varianceDavies Liu2015-11-032-0/+105
| | | | | | | | Add Python API for stddev/stddev_pop/stddev_samp/variance/var_pop/var_samp/skewness/kurtosis Author: Davies Liu <davies@databricks.com> Closes #9424 from davies/py_var.
* [SPARK-10592] [ML] [PySpark] Deprecate weights and use coefficients instead ↵vectorijk2015-11-022-0/+25
| | | | | | | | | | in ML models Deprecated in `LogisticRegression` and `LinearRegression` Author: vectorijk <jiangkai@gmail.com> Closes #9311 from vectorijk/spark-10592.
* [SPARK-10286][ML][PYSPARK][DOCS] Add @since annotation to pyspark.ml.param ↵lihao2015-11-024-0/+230
| | | | | | | | and pyspark.ml.* Author: lihao <lihaowhu@gmail.com> Closes #9275 from lidinghao/SPARK-10286.
* [SPARK-11358][MLLIB] deprecate runs in k-meansXiangrui Meng2015-11-021-0/+4
| | | | | | | | | | This PR deprecates `runs` in k-means. `runs` introduces extra complexity and overhead in MLlib's k-means implementation. I haven't seen much usage with `runs` not equal to `1`. We don't have a unit test for it either. We can deprecate this method in 1.6, and void it in 1.7. It helps us simplify the implementation. cc: srowen Author: Xiangrui Meng <meng@databricks.com> Closes #9322 from mengxr/SPARK-11358.