| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current implementation of statistics of UnaryNode does not considering output (for example, Project may product much less columns than it's child), we should considering it to have a better guess.
We usually only join with few columns from a parquet table, the size of projected plan could be much smaller than the original parquet files. Having a better guess of size help we choose between broadcast join or sort merge join.
After this PR, I saw a few queries choose broadcast join other than sort merge join without turning spark.sql.autoBroadcastJoinThreshold for every query, ended up with about 6-8X improvements on end-to-end time.
We use `defaultSize` of DataType to estimate the size of a column, currently For DecimalType/StringType/BinaryType and UDT, we are over-estimate too much (4096 Bytes), so this PR change them to some more reasonable values. Here are the new defaultSize for them:
DecimalType: 8 or 16 bytes, based on the precision
StringType: 20 bytes
BinaryType: 100 bytes
UDF: default size of SQL type
These numbers are not perfect (hard to have a perfect number for them), but should be better than 4096.
Author: Davies Liu <davies@databricks.com>
Closes #11210 from davies/statics.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
MLlib
## What changes were proposed in this pull request?
In order to provide better and consistent result, let's change the default value of MLlib ```LogisticRegressionWithLBFGS convergenceTol``` from ```1E-4``` to ```1E-6``` which will be equal to ML ```LogisticRegression```.
cc dbtsai
## How was the this patch tested?
unit tests
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #11299 from yanboliang/spark-13429.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
format
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the fpm and recommendation modules.
Closes #10602
Closes #10897
Author: Bryan Cutler <cutlerb@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>
Closes #11186 from BryanCutler/param-desc-consistent-fpmrecc-SPARK-12632.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in other comments
## What changes were proposed in this pull request?
This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.
## How was the this patch tested?
manual tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #11300 from dongjoon-hyun/minor_fix_typos.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
and other tuning for Word2Vec
add support of arbitrary length sentence by using the nature representation of sentences in the input.
add new similarity functions and add normalization option for distances in synonym finding
add new accessor for internal structure(the vocabulary and wordindex) for convenience
need instructions about how to set value for the Since annotation for newly added public functions. 1.5.3?
jira link: https://issues.apache.org/jira/browse/SPARK-12153
Author: Yong Gang Cao <ygcao@amazon.com>
Author: Yong-Gang Cao <ygcao@users.noreply.github.com>
Closes #10152 from ygcao/improvementForSentenceBoundary.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations.
This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below.
```
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types
schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])
a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
c = a.unionAll(b)
```
## How was the this patch tested?
Tested using two unit tests in sql/test.py and the DataFrameSuite.
Additional information here : https://issues.apache.org/jira/browse/SPARK-13410
Author: Franklyn D'souza <franklynd@gmail.com>
Closes #11279 from damnMeddlingKid/udt-union-all.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR introduces several major changes:
1. Replacing `Expression.prettyString` with `Expression.sql`
The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.
1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)
Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird. Here are several examples:
Expression | `prettyString` | `sql` | Note
------------------ | -------------- | ---------- | ---------------
`a && b` | `a && b` | `a AND b` |
`a.getField("f")` | `a[f]` | `a.f` | `a` is a struct
1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)
`NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.
Author: Cheng Lian <lian@databricks.com>
Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
|
|
|
|
|
|
|
|
|
|
|
|
| |
outside of the doctests
Some of the new doctests in ml/clustering.py have a lot of setup code, move the setup code to the general test init to keep the doctest more example-style looking.
In part this is a follow up to https://github.com/apache/spark/pull/10999
Note that the same pattern is followed in regression & recommendation - might as well clean up all three at the same time.
Author: Holden Karau <holden@us.ibm.com>
Closes #11197 from holdenk/SPARK-13302-cleanup-doctests-in-ml-clustering.
|
|
|
|
| |
This reverts commit 4f9a66481849dc867cf6592d53e0e9782361d20a.
|
|
|
|
|
|
| |
Author: Kai Jiang <jiangkai@gmail.com>
Closes #10527 from vectorijk/spark-12567.
|
|
|
|
|
|
|
|
|
|
|
|
| |
for reduce, fold
Clarify that reduce functions need to be commutative, and fold functions do not
See https://github.com/apache/spark/pull/11091
Author: Sean Owen <sowen@cloudera.com>
Closes #11217 from srowen/SPARK-13339.
|
|
|
|
|
|
|
|
| |
There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect.
Author: Miles Yucht <miles@databricks.com>
Closes #11213 from mgyucht/fix-sparsevector-docs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This pull request has the following changes:
1. Moved UserDefinedFunction into expressions package. This is more consistent with how we structure the packages for window functions and UDAFs.
2. Moved UserDefinedPythonFunction into execution.python package, so we don't have a random private class in the top level sql package.
3. Move everything in execution/python.scala into the newly created execution.python package.
Most of the diffs are just straight copy-paste.
Author: Reynold Xin <rxin@databricks.com>
Closes #11181 from rxin/SPARK-13296.
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-12363
This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes #10539 from viirya/fix-poweriter.
|
|
|
|
|
|
|
|
|
|
|
| |
consistent format
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the classification module.
Author: vijaykiran <mail@vijaykiran.com>
Author: Bryan Cutler <cutlerb@gmail.com>
Closes #11183 from BryanCutler/pyspark-consistent-param-classification-SPARK-12630.
|
|
|
|
|
|
|
|
|
|
| |
PySpark support ```covar_samp``` and ```covar_pop```.
cc rxin davies marmbrus
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10876 from yanboliang/spark-12962.
|
|
|
|
|
|
|
|
|
|
| |
We should have lint rules using sphinx to automatically catch the pydoc issues that are sometimes introduced.
Right now ./dev/lint-python will skip building the docs if sphinx isn't present - but it might make sense to fail hard - just a matter of if we want to insist all PySpark developers have sphinx present.
Author: Holden Karau <holden@us.ibm.com>
Closes #11109 from holdenk/SPARK-13154-add-pydoc-lint-for-docs.
|
|
|
|
|
|
|
|
| |
Add Python API for spark.ml bisecting k-means.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10889 from yanboliang/spark-12974.
|
|
|
|
|
|
|
|
|
|
|
|
| |
parameter
Fix this defect by check default value exist or not.
yanboliang Please help to review.
Author: Tommy YU <tummyyu@163.com>
Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality.
In Python:
```python
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
print nb.hasParam("smoothing")
print nb.hasParam("notAParam")
```
produces:
> True
> AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
However, in Scala:
```scala
import org.apache.spark.ml.classification.NaiveBayes
val nb = new NaiveBayes()
nb.hasParam("smoothing")
nb.hasParam("notAParam")
```
produces:
> true
> false
cc holdenk
Author: sethah <seth.hendrickson16@gmail.com>
Closes #10962 from sethah/SPARK-13047.
|
|
|
|
|
|
|
|
| |
PySpark ml.clustering support export/import.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10999 from yanboliang/spark-13035.
|
|
|
|
|
|
|
|
|
| |
Test cases should be removed from annotation of ```setXXX``` function, otherwise it will be parts of [Python API docs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans.setInitMode).
cc mengxr jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10975 from yanboliang/clustering-cleanup.
|
|
|
|
|
|
|
|
| |
PySpark ml.recommendation support export/import.
Author: Kai Jiang <jiangkai@gmail.com>
Closes #11044 from vectorijk/spark-13037.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels.
grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR.
The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive).
Author: Davies Liu <davies@databricks.com>
Closes #10677 from davies/grouping.
|
|
|
|
|
|
|
|
| |
I have fixed the warnings by running "make html" under "python/docs/". They are caused by not having blank lines around indented paragraphs.
Author: Nam Pham <phamducnam@gmail.com>
Closes #11025 from nampham2/SPARK-12986.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
structures
rxin srowen
I work out note message for rdd.take function, please help to review.
If it's fine, I can apply to all other function later.
Author: Tommy YU <tummyyu@163.com>
Closes #10874 from Wenpei/spark-5865-add-warning-for-localdatastructure.
|
|
|
|
|
|
|
|
|
|
|
|
| |
`rpcEnv.awaitTermination()` was not added in #10854 because some Streaming Python tests hung forever.
This patch fixed the hung issue and added rpcEnv.awaitTermination() back to SparkEnv.
Previously, Streaming Kafka Python tests shutdowns the zookeeper server before stopping StreamingContext. Then when stopping StreamingContext, KafkaReceiver may be hung due to https://issues.apache.org/jira/browse/KAFKA-601, hence, some thread of RpcEnv's Dispatcher cannot exit and rpcEnv.awaitTermination is hung.The patch just changed the shutdown order to fix it.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #11031 from zsxwing/awaitTermination.
|
|
|
|
|
|
|
|
|
|
| |
format
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the clustering module.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes #10610 from BryanCutler/param-desc-consistent-cluster-SPARK-12631.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR adds the ability to specify the ```ignoreNulls``` option to the functions dsl, e.g:
```df.select($"id", last($"value", ignoreNulls = true).over(Window.partitionBy($"id").orderBy($"other"))```
This PR is some where between a bug fix (see the JIRA) and a new feature. I am not sure if we should backport to 1.6.
cc yhuai
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes #10957 from hvanhovell/SPARK-13049.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
LinearRegression as example
* Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark.
* Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community.
cc mengxr jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #10469 from yanboliang/spark-11939.
|
|
|
|
|
|
|
|
|
|
| |
I tried to add this via `USE_BIG_DECIMAL_FOR_FLOATS` option from Jackson with no success.
Added test for non-complex types. Should I add a test for complex types?
Author: Brandon Bradley <bradleytastic@gmail.com>
Closes #10936 from blbradley/spark-12749.
|
|
|
|
|
|
|
|
|
|
| |
`None` triggers cryptic failure
The error message is now changed from "Do not support type class scala.Tuple2." to "Do not support type class org.json4s.JsonAST$JNull$" to be more informative about what is not supported. Also, StructType metadata now handles JNull correctly, i.e., {'a': None}. test_metadata_null is added to tests.py to show the fix works.
Author: Jason Lee <cjlee@us.ibm.com>
Closes #8969 from jasoncl/SPARK-10847.
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-12780
Author: Xusen Yin <yinxusen@gmail.com>
Closes #10724 from yinxusen/SPARK-12780.
|
|
|
|
|
|
|
|
| |
The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh).
Author: Holden Karau <holden@us.ibm.com>
Closes #10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
|
|
|
|
|
|
|
|
| |
environment variable ADD_FILES is created for adding python files on spark context to be distributed to executors (SPARK-865), this is deprecated now. User are encouraged to use --py-files for adding python files.
Author: Jeff Zhang <zjffdu@apache.org>
Closes #10913 from zjffdu/SPARK-12993.
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-11923
Author: Xusen Yin <yinxusen@gmail.com>
Closes #10186 from yinxusen/SPARK-11923.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PySpark for now
I saw several failures from recent PR builds, e.g., https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50015/consoleFull. This PR marks the test as ignored and we will fix the flakyness in SPARK-10086.
gliptak Do you know why the test failure didn't show up in the Jenkins "Test Result"?
cc: jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #10909 from mengxr/SPARK-10086.
|
|
|
|
|
|
|
|
|
|
|
| |
Add Python API for ml.feature.QuantileDiscretizer.
One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model.
cc brkyvz & mengxr
Author: Holden Karau <holden@us.ibm.com>
Closes #10085 from holdenk/SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer.
|
|
|
|
|
|
|
|
|
|
| |
```PCAModel``` can output ```explainedVariance``` at Python side.
cc mengxr srowen
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10830 from yanboliang/spark-12905.
|
|
|
|
|
|
|
|
|
|
| |
Python rows
When actual row length doesn't conform to specified schema field length, we should give a better error message instead of throwing an unintuitive `ArrayOutOfBoundsException`.
Author: Cheng Lian <lian@databricks.com>
Closes #10886 from liancheng/spark-12624.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
…ialize HiveContext in PySpark
davies Mind to review ?
This is the error message after this PR
```
15/12/03 16:59:53 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
/Users/jzhang/github/spark/python/pyspark/sql/context.py:689: UserWarning: You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
warnings.warn("You must build Spark with Hive. "
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 663, in read
return DataFrameReader(self)
File "/Users/jzhang/github/spark/python/pyspark/sql/readwriter.py", line 56, in __init__
self._jreader = sqlContext._ssql_ctx.read()
File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 692, in _ssql_ctx
raise e
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.net.ConnectException: Call From jzhangMBPr.local/127.0.0.1 to 0.0.0.0:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40)
at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
```
Author: Jeff Zhang <zjffdu@apache.org>
Closes #10126 from zjffdu/SPARK-12120.
|
|
|
|
|
|
|
|
|
| |
This is #9263 from gliptak (improving grouping/display of test case results) with a small fix of bisecting k-means unit test.
Author: Gábor Lipták <gliptak@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes #10850 from mengxr/SPARK-11295.
|
|
|
|
| |
This reverts commit c6f971b4aeca7265ab374fa46c5c452461d9b6a7.
|
|
|
|
|
|
|
|
|
|
| |
prediction column
This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes #10472 from BenFradet/SPARK-9716.
|
|
|
|
|
|
|
|
|
|
| |
SPARK-11295 Add packages to JUnit output for Python tests
This improves grouping/display of test case results.
Author: Gábor Lipták <gliptak@gmail.com>
Closes #9263 from gliptak/SPARK-11295.
|
|
|
|
|
|
|
|
| |
From the coverage issues for 1.6 : Add Python API for mllib.clustering.BisectingKMeans.
Author: Holden Karau <holden@us.ibm.com>
Closes #10150 from holdenk/SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix order of arguments that Pyspark RDD.fold passes to its op - should be (acc, obj) like other implementations.
Obviously, this is a potentially breaking change, so can only happen for 2.x
CC davies
Author: Sean Owen <sowen@cloudera.com>
Closes #10771 from srowen/SPARK-7683.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Spark 1.6 QA
Add PySpark missing methods and params for ml.feature:
* ```RegexTokenizer``` should support setting ```toLowercase```.
* ```MinMaxScalerModel``` should support output ```originalMin``` and ```originalMax```.
* ```PCAModel``` should support output ```pc```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9908 from yanboliang/spark-11925.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base.
Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out:
- The SQL Parser allowed syntax like ```APPROXIMATE(0.01) COUNT(DISTINCT a)```. In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. ```APPROXIMATE_COUNT_DISTINCT(a, 0.01)``` would also do the job and is much easier to maintain. So, this PR **removes** this keyword.
- The old SQL Parser supports ```LIMIT``` clauses in nested queries. This is **not supported** anymore. See https://github.com/apache/spark/pull/10689 for the rationale for this.
- Hive has a charset name char set literal combination it supports, for instance the following expression ```_ISO-8859-1 0x4341464562616265``` would yield this string: ```CAFEbabe```. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we **remove** this feature from the parser. It would be quite easy to implement such a feature as an Expression later on.
- Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a ```Double``` whereas the SQL Parser would convert a non-scientific decimal into a ```BigDecimal```, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: ```81923801.42BD```, which can be used when a big decimal is needed.
cc rxin viirya marmbrus yhuai cloud-fan
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes #10745 from hvanhovell/SPARK-12575-2.
|
|
|
|
|
|
|
|
|
|
| |
This PR makes bucketing and exchange share one common hash algorithm, so that we can guarantee the data distribution is same between shuffle and bucketed data source, which enables us to only shuffle one side when join a bucketed table and a normal one.
This PR also fixes the tests that are broken by the new hash behaviour in shuffle.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10703 from cloud-fan/use-hash-expr-in-shuffle.
|