| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
| |
Author: guowei2 <guowei@growingio.com>
Closes #7254 from guowei2/spark-8865 and squashes the following commits:
48ca17a [guowei2] fix contains key
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
serialization for Python DataFrame
This PR fix the long standing issue of serialization between Python RDD and DataFrame, it change to using a customized Pickler for InternalRow to enable customized unpickling (type conversion, especially for UDT), now we can support UDT for UDF, cc mengxr .
There is no generated `Row` anymore.
Author: Davies Liu <davies@databricks.com>
Closes #7301 from davies/sql_ser and squashes the following commits:
81bef71 [Davies Liu] address comments
e9217bd [Davies Liu] add regression tests
db34167 [Davies Liu] Refactor of serialization for Python DataFrame
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR propose a simple way to expose OffsetRange in Python code, also the usage of offsetRanges is similar to Scala/Java way, here in Python we could get OffsetRange like:
```
dstream.foreachRDD(lambda r: KafkaUtils.offsetRanges(r))
```
Reason I didn't follow the way what SPARK-8389 suggested is that: Python Kafka API has one more step to decode the message compared to Scala/Java, Which makes Python API return a transformed RDD/DStream, not directly wrapped so-called JavaKafkaRDD, so it is hard to backtrack to the original RDD to get the offsetRange.
Author: jerryshao <saisai.shao@intel.com>
Closes #7185 from jerryshao/SPARK-8389 and squashes the following commits:
4c6d320 [jerryshao] Another way to fix subclass deserialization issue
e6a8011 [jerryshao] Address the comments
fd13937 [jerryshao] Fix serialization bug
7debf1c [jerryshao] bug fix
cff3893 [jerryshao] refactor the code according to the comments
2aabf9e [jerryshao] Style fix
848c708 [jerryshao] Add HasOffsetRanges for Python
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR adds `metadata` to `InputInfo`. `InputDStream` can report its metadata for a batch and it will be shown in the batch page.
For example,
![screen shot](https://cloud.githubusercontent.com/assets/1000778/8403741/d6ffc7e2-1e79-11e5-9888-c78c1575123a.png)
FileInputDStream will display the new files for a batch, and DirectKafkaInputDStream will display its offset ranges.
Author: zsxwing <zsxwing@gmail.com>
Closes #7081 from zsxwing/input-metadata and squashes the following commits:
f7abd9b [zsxwing] Revert the space changes in project/MimaExcludes.scala
d906209 [zsxwing] Merge branch 'master' into input-metadata
74762da [zsxwing] Fix MiMa tests
7903e33 [zsxwing] Merge branch 'master' into input-metadata
450a46c [zsxwing] Address comments
1d94582 [zsxwing] Raname InputInfo to StreamInputInfo and change "metadata" to Map[String, Any]
d496ae9 [zsxwing] Add input metadata in the batch page
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
scheduler
This is largely based on extracting the dynamic allocation parts from tnachen's #3861.
Author: Iulian Dragos <jaguarul@gmail.com>
Closes #4984 from dragos/issue/mesos-coarse-dynamicAllocation and squashes the following commits:
39df8cd [Iulian Dragos] Update tests to latest changes in core.
9d2c9fa [Iulian Dragos] Remove adjustment of executorLimitOption in doKillExecutors.
8b00f52 [Iulian Dragos] Latest round of reviews.
0cd00e0 [Iulian Dragos] Add persistent shuffle directory
15c45c1 [Iulian Dragos] Add dynamic allocation to the Spark coarse-grained scheduler.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(This reopens a patch that was closed in the past: #6248)
When you view the stage page while running the following:
```
sc.parallelize(1 to X, 10000).count()
```
The page never loads, the job is stalled, and you end up running into an OOM:
```
HTTP ERROR 500
Problem accessing /stages/stage/. Reason:
Server Error
Caused by:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
```
This patch compresses Jetty responses in gzip. The correct long-term fix is to add pagination.
Author: Andrew Or <andrew@databricks.com>
Closes #7296 from andrewor14/gzip-jetty and squashes the following commits:
a051c64 [Andrew Or] Use GZIP to compress Jetty responses
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
input stream
Fix this failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2886/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming/CheckpointSuite/recovery_with_file_input_stream/
To reproduce this failure, you can add `Thread.sleep(2000)` before this line
https://github.com/apache/spark/blob/a9c4e29950a14e32acaac547e9a0e8879fd37fc9/streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala#L477
Author: zsxwing <zsxwing@gmail.com>
Closes #7323 from zsxwing/SPARK-7419 and squashes the following commits:
b3caf58 [zsxwing] Fix CheckpointSuite.recovery with file input stream
|
|
|
|
|
|
|
|
|
|
| |
The configuration ```SPARK_EXECUTOR_CORES``` won't put into ```SparkConf```, so it has no effect to the dynamic executor allocation.
Author: xutingjun <xutingjun@huawei.com>
Closes #7322 from XuTingjun/SPARK_EXECUTOR_CORES and squashes the following commits:
2cafa89 [xutingjun] make SPARK_EXECUTOR_CORES has effect to dynamicAllocation
|
|
|
|
|
|
|
|
|
|
| |
Log statements incorrectly showed that the executor was being stopped when receiver was being stopped.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #7328 from tdas/fix-log and squashes the following commits:
9cc6e99 [Tathagata Das] Fix log statements.
|
|
|
|
|
|
|
|
|
|
|
| |
[SPARK-8258] [SPARK-8259] [SPARK-8261] [SPARK-8262] [SPARK-8253] [SPARK-8260] [SPARK-8267] [SQL] Add String Expressions
Author: Cheng Hao <hao.cheng@intel.com>
Closes #6762 from chenghao-intel/str_funcs and squashes the following commits:
b09a909 [Cheng Hao] update the code as feedback
7ebbf4c [Cheng Hao] Add more string expressions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
document to words count vector
jira: https://issues.apache.org/jira/browse/SPARK-8703
Converts a text document to a sparse vector of token counts.
I can further add an estimator to extract vocabulary from corpus if that's appropriate.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #7084 from hhbyyh/countVectorization and squashes the following commits:
5f3f655 [Yuhao Yang] text change
24728e4 [Yuhao Yang] style improvement
576728a [Yuhao Yang] rename to model and some fix
1deca28 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into countVectorization
99b0c14 [Yuhao Yang] undo extension from HashingTF
12c2dc8 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into countVectorization
7ee1c31 [Yuhao Yang] extends HashingTF
809fb59 [Yuhao Yang] minor fix for ut
7c61fb3 [Yuhao Yang] add countVectorizer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
boto config
'spark_ec2.py' use boto to control ec2.
And boto can support '~/.aws/credentials' which is AWS CLI default configuration file.
We can check this information from ref of boto.
"A boto config file is a text file formatted like an .ini configuration file that specifies values for options that control the behavior of the boto library. In Unix/Linux systems, on startup, the boto library looks for configuration files in the following locations and in the following order:
/etc/boto.cfg - for site-wide settings that all users on this machine will use
(if profile is given) ~/.aws/credentials - for credentials shared between SDKs
(if profile is given) ~/.boto - for user-specific settings
~/.aws/credentials - for credentials shared between SDKs
~/.boto - for user-specific settings"
* ref of boto: http://boto.readthedocs.org/en/latest/boto_config_tut.html
* ref of aws cli : http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
However 'spark_ec2.py' only check boto config & environment variable even if there is '~/.aws/credentials', and 'spark_ec2.py' is terminated.
So I changed to check '~/.aws/credentials'.
cc rxin
Jira : https://issues.apache.org/jira/browse/SPARK-8863
Author: JPark <JPark@JPark.me>
Closes #7252 from JuhongPark/master and squashes the following commits:
23c5792 [JPark] Check aws access key from aws credentials if there is no boto config
|
|
|
|
|
|
|
|
| |
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #7315 from cloud-fan/toString and squashes the following commits:
4fc8d80 [Wenchen Fan] Implement toString for Interval data type
|
|
|
|
|
|
|
|
|
|
|
| |
I merged https://github.com/apache/spark/pull/7303 so it unblocks another PR. This addresses my own code review comment for that PR.
Author: Reynold Xin <rxin@databricks.com>
Closes #7313 from rxin/adt and squashes the following commits:
7ade82b [Reynold Xin] Fixed unit tests.
f8d5533 [Reynold Xin] [SPARK-8926][SQL] Code review followup.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Also added more documentation for the file.
Author: Reynold Xin <rxin@databricks.com>
Closes #7316 from rxin/extract-value and squashes the following commits:
069cb7e [Reynold Xin] Removed ExtractValueWithOrdinal.
621b705 [Reynold Xin] Reverted a line.
11ebd6c [Reynold Xin] [Minor][SQL] Improve documentation for complex type extractors.
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-8940
The given `schema` parameter will be overwritten in `createDataFrame` now. If it is not null, we shouldn't overwrite it.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes #7311 from viirya/df_not_overwrite_schema and squashes the following commits:
2385139 [Liang-Chi Hsieh] Don't overwrite given schema if it is not null.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Jira: https://issues.apache.org/jira/browse/SPARK-8830
rxin and HuJiayin can you have a look on it.
Author: Tarek Auel <tarek.auel@googlemail.com>
Closes #7236 from tarekauel/native-levenshtein-distance and squashes the following commits:
ee4c4de [Tarek Auel] [SPARK-8830] implemented improvement proposals
c252e71 [Tarek Auel] [SPARK-8830] removed chartAt; use unsafe method for byte array comparison
ddf2222 [Tarek Auel] Merge branch 'master' into native-levenshtein-distance
179920a [Tarek Auel] [SPARK-8830] added description
5e9ed54 [Tarek Auel] [SPARK-8830] removed StringUtils import
dce4308 [Tarek Auel] [SPARK-8830] native levenshtein distance
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in codegen
Exception will not be catched during tests.
cc marmbrus rxin
Author: Davies Liu <davies@databricks.com>
Closes #7309 from davies/fallback and squashes the following commits:
969a612 [Davies Liu] throw exception during tests
f844f77 [Davies Liu] fallback
a3091bc [Davies Liu] Merge branch 'master' of github.com:apache/spark into fallback
364a0d6 [Davies Liu] fallback to interpret mode if failed to compile
|
|
|
|
|
|
|
|
|
|
|
|
| |
Write missing pydocs in `SparseVector` attributes.
Author: lewuathe <lewuathe@me.com>
Closes #7290 from Lewuathe/SPARK-6266 and squashes the following commits:
51d9895 [lewuathe] Update docs
0480d35 [lewuathe] Merge branch 'master' into SPARK-6266
ba42cf3 [lewuathe] [SPARK-6266] PySpark SparseVector missing doc for size, indices, values
|
|
|
|
|
|
|
|
| |
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #7312 from cloud-fan/minor and squashes the following commits:
a4589fa [Wenchen Fan] use double not decimal when cast double and float to timestamp
|
|
|
|
|
|
|
|
|
|
|
|
| |
handling Parquet LISTs in compatible mode
This PR is based on #7209 authored by Sephiroth-Lin.
Author: Weizhong Lin <linweizhong@huawei.com>
Closes #7314 from liancheng/spark-8928 and squashes the following commits:
75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when handling LISTs in compatible mode
|
|
|
|
|
|
| |
when handling Parquet LISTs in compatible mode"
This reverts commit 3dab0da42940a46f0c4aa4853bdb5c64c4cb2613.
|
|
|
|
|
|
|
|
|
|
|
|
| |
handling Parquet LISTs in compatible mode
This PR is based on #7209 authored by Sephiroth-Lin.
Author: Weizhong Lin <linweizhong@huawei.com>
Closes #7304 from liancheng/spark-8928 and squashes the following commits:
75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when handling LISTs in compatible mode
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For example: `cannot resolve 'testfunction(null)' due to data type mismatch: argument 1 is expected to be of type int, however, null is of type datetype.`
Author: Michael Armbrust <michael@databricks.com>
Closes #7303 from marmbrus/expectsTypeErrors and squashes the following commits:
c654a0e [Michael Armbrust] fix udts and make errors pretty
137160d [Michael Armbrust] style
5428fda [Michael Armbrust] style
10fac82 [Michael Armbrust] [SPARK-8926][SQL] Good errors for ExpectsInputType expressions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
missing in ScalaTest config.
`spark.unsafe.exceptionOnMemoryLeak` is present in the config of surefire.
```
<!-- Surefire runs all Java tests -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.18.1</version>
<!-- Note config is repeated in scalatest config -->
...
<spark.unsafe.exceptionOnMemoryLeak>true</spark.unsafe.exceptionOnMemoryLeak>
</systemProperties>
...
```
but is absent in the config ScalaTest.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #7308 from sarutak/add-setting-for-memory-leak and squashes the following commits:
95644e7 [Kousuke Saruta] Added a setting for memory leak
|
|
|
|
|
|
|
|
|
|
|
|
| |
Due to the way MiMa works, we currently start a `SQLContext` pretty early on. This causes us to start a `SparkUI` that attempts to bind to port 4040. Because many tests run in parallel on the Jenkins machines, this causes port contention sometimes and fails the MiMa tests.
Note that we already disabled the SparkUI for scalatests. However, the MiMa test is run before we even have a chance to load the default scalatest settings, so we need to explicitly disable the UI ourselves.
Author: Andrew Or <andrew@databricks.com>
Closes #7300 from andrewor14/mima-flaky and squashes the following commits:
b55a547 [Andrew Or] Do not enable SparkUI during tests
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We call Row.copy() in many places throughout SQL but UnsafeRow currently throws UnsupportedOperationException when copy() is called.
Supporting copying when ObjectPool is used may be difficult, since we may need to handle deep-copying of objects in the pool. In addition, this copy() method needs to produce a self-contained row object which may be passed around / buffered by downstream code which does not understand the UnsafeRow format.
In the long run, we'll need to figure out how to handle the ObjectPool corner cases, but this may be unnecessary if other changes are made. Therefore, in order to unblock my sort patch (#6444) I propose that we support copy() for the cases where UnsafeRow does not use an ObjectPool and continue to throw UnsupportedOperationException when an ObjectPool is used.
This patch accomplishes this by modifying UnsafeRow so that it knows the size of the row's backing data in order to be able to copy it into a byte array.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #7306 from JoshRosen/SPARK-8932 and squashes the following commits:
338e6bf [Josh Rosen] Support copy for UnsafeRows that do not use ObjectPools.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-8866
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes #7283 from yijieshen/micro_timestamp and squashes the following commits:
dc735df [Yijie Shen] update CastSuite to avoid round error
714eaea [Yijie Shen] add timestamp_udf into blacklist due to precision lose
c3ca2f4 [Yijie Shen] fix unhandled case in CurrentTimestamp
8d4aa6b [Yijie Shen] use 1us precision for timestamp type
|
|
|
|
|
|
|
|
|
|
| |
A couple descriptions were not inside `<td></td>` and were being displayed immediately under the section title instead of in their row.
Author: Jonathan Alter <jonalter@users.noreply.github.com>
Closes #7292 from jonalter/docs-config and squashes the following commits:
5ce1570 [Jonathan Alter] [DOCS] Format wrong for some config descriptions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR fixes the converter for Python DataFrame, especially for DecimalType
Closes #7106
Author: Davies Liu <davies@databricks.com>
Closes #7131 from davies/decimal_python and squashes the following commits:
4d3c234 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
20531d6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
7d73168 [Davies Liu] fix conflit
6cdd86a [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
7104e97 [Davies Liu] improve type infer
9cd5a21 [Davies Liu] run python tests with SPARK_PREPEND_CLASSES
829a05b [Davies Liu] fix UDT in python
c99e8c5 [Davies Liu] fix mima
c46814a [Davies Liu] convert decimal for Python DataFrames
|
|
|
|
|
|
|
|
|
|
|
| |
As rxin suggested in #7298 , we should consider to remove `RDDApi`.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #7302 from sarutak/remove-rddapi and squashes the following commits:
e495d35 [Kousuke Saruta] Fixed mima
cb7ebb9 [Kousuke Saruta] Removed overriding RDDApi
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Distribute expensive portions of computation for Gaussian mixture components (in particular, pre-computation of `MultivariateGaussian.rootSigmaInv`, the inverse covariance matrix and covariance determinant) across executors. Repost of PR#4654.
Notes for reviewers:
* What should be the policy for when to distribute computation. Always? When numClusters > threshold? User-specified param?
TODO:
* Performance testing and comparison for large number of clusters
Author: Feynman Liang <fliang@databricks.com>
Closes #7166 from feynmanliang/GMM_parallel_mixtures and squashes the following commits:
4f351fa [Feynman Liang] Update heuristic and scaladoc
5ea947e [Feynman Liang] Fix parallelization logic
00eb7db [Feynman Liang] Add helper method for GMM's M step, remove distributeGaussians flag
e7c8127 [Feynman Liang] Add distributeGaussians flag and tests
1da3c7f [Feynman Liang] Distribute mixtures
|
|
|
|
|
|
|
|
|
|
|
| |
Adds FPGrowth.generateAssociationRules to public API for generating association rules after mining frequent itemsets.
Author: Feynman Liang <fliang@databricks.com>
Closes #7271 from feynmanliang/SPARK-8877 and squashes the following commits:
83b8baf [Feynman Liang] Add API Doc
867abff [Feynman Liang] Add FPGrowth.generateAssociationRules and change access modifiers for AssociationRules
|
|
|
|
|
|
|
|
|
|
|
|
| |
in pyspark/mllib
Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #7286 from yanboliang/spark-8068 and squashes the following commits:
6109fe1 [Yanbo Liang] Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
interoperability and backwards-compatibility
This PR is a follow-up of #6617 and is part of [SPARK-6774] [2], which aims to ensure interoperability and backwards-compatibility for Spark SQL Parquet support. And this one fixes the read path. Now Spark SQL is expected to be able to read legacy Parquet data files generated by most (if not all) common libraries/tools like parquet-thrift, parquet-avro, and parquet-hive. However, we still need to refactor the write path to write standard Parquet LISTs and MAPs ([SPARK-8848] [4]).
### Major changes
1. `CatalystConverter` class hierarchy refactoring
- Replaces `CatalystConverter` trait with a much simpler `ParentContainerUpdater`.
Now instead of extending the original `CatalystConverter` trait, every converter class accepts an updater which is responsible for propagating the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row. Root converter doesn't have a parent and thus uses a `NoopUpdater`.
This simplifies the design since converters don't need to care about details of their parent converters anymore.
- Unifies `CatalystRootConverter`, `CatalystGroupConverter` and `CatalystPrimitiveRowConverter` into `CatalystRowConverter`
Specifically, now all row objects are represented by `SpecificMutableRow` during conversion.
- Refactors `CatalystArrayConverter`, and removes `CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter`
`CatalystNativeArrayConverter` was probably designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal.
The new `CatalystArrayConverter` handles both nullable and non-nullable array elements in a consistent way.
- Implements backwards-compatibility rules in `CatalystArrayConverter`
When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see [backwards-compatibility rules for MAP] [1]), we only need to deal with LIST (namely array) in `CatalystArrayConverter`.
2. Requested columns handling
When specifying requested columns in `RowReadSupport`, we used to use a Parquet `MessageType` converted from a Catalyst `StructType` which contains all requested columns. This is not preferable when taking compatibility and interoperability into consideration. Because the actual Parquet file may have different physical structure from the converted schema.
In this PR, the schema for requested columns is constructed using the following method:
- For a column that exists in the target Parquet file, we extract the column type by name from the full file schema, and construct a single-field `MessageType` for that column.
- For a column that doesn't exist in the target Parquet file, we create a single-field `StructType` and convert it to a `MessageType` using `CatalystSchemaConverter`.
- Unions all single-field `MessageType`s into a full schema containing all requested fields
With this change, we also fix [SPARK-6123] [3] by validating the global schema against each individual Parquet part-files.
### Testing
This PR also adds compatibility tests for parquet-avro, parquet-thrift, and parquet-hive. Please refer to `README.md` under `sql/core/src/test` for more information about these tests. To avoid build time code generation and adding extra complexity to the build system, Java code generated from testing Thrift schema and Avro IDL is also checked in.
[1]: https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
[2]: https://issues.apache.org/jira/browse/SPARK-6774
[3]: https://issues.apache.org/jira/browse/SPARK-6123
[4]: https://issues.apache.org/jira/browse/SPARK-8848
Author: Cheng Lian <lian@databricks.com>
Closes #7231 from liancheng/spark-6776 and squashes the following commits:
360fe18 [Cheng Lian] Adds ParquetHiveCompatibilitySuite
c6fbc06 [Cheng Lian] Removes WIP file committed by mistake
b8c1295 [Cheng Lian] Excludes the whole parquet package from MiMa
598c3e8 [Cheng Lian] Adds extra Maven repo for hadoop-lzo, which is a transitive dependency of parquet-thrift
926af87 [Cheng Lian] Simplifies Parquet compatibility test suites
7946ee1 [Cheng Lian] Fixes Scala styling issues
3d7ab36 [Cheng Lian] Fixes .rat-excludes
a8f13bb [Cheng Lian] Using Parquet writer API to do compatibility tests
f2208cd [Cheng Lian] Adds README.md for Thrift/Avro code generation
1d390aa [Cheng Lian] Adds parquet-thrift compatibility test
440f7b3 [Cheng Lian] Adds generated files to .rat-excludes
13b9121 [Cheng Lian] Adds ParquetAvroCompatibilitySuite
06cfe9d [Cheng Lian] Adds comments about TimestampType handling
a099d3e [Cheng Lian] More comments
0cc1b37 [Cheng Lian] Fixes MiMa checks
884d3e6 [Cheng Lian] Fixes styling issue and reverts unnecessary changes
802cbd7 [Cheng Lian] Fixes bugs related to schema merging and empty requested columns
38fe1e7 [Cheng Lian] Adds explicit return type
7fb21f1 [Cheng Lian] Reverts an unnecessary debugging change
1781dff [Cheng Lian] Adds test case for SPARK-8811
6437d4b [Cheng Lian] Assembles requested schema from Parquet file schema
bcac49f [Cheng Lian] Removes the 16-byte restriction of decimals
a74fb2c [Cheng Lian] More comments
0525346 [Cheng Lian] Removes old Parquet record converters
03c3bd9 [Cheng Lian] Refactors Parquet read path to implement backwards-compatibility rules
|
|
|
|
|
|
|
|
|
|
|
|
| |
With "+" the strings are separate expressions, and format() is called on the last string before concatenation. (So substitution does not happen.) Without "+" the string literals are merged first by the parser, so format() is called on the complete string.
Should I make a JIRA for this?
Author: Daniel Darabos <darabos.daniel@gmail.com>
Closes #7288 from darabos/patch-2 and squashes the following commits:
be0d3b7 [Daniel Darabos] Correctly print hostname in error
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same.
In R, there is an option for this.
`standardize`
Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian".
+cc holdenk mengxr jkbradley
Author: DB Tsai <dbt@netflix.com>
Closes #7080 from dbtsai/lors and squashes the following commits:
877e6c7 [DB Tsai] repahse the doc
7cf45f2 [DB Tsai] address feedback
78d75c9 [DB Tsai] small change
c2c9e60 [DB Tsai] style
6e1a8e0 [DB Tsai] first commit
|
|
|
|
|
|
|
|
|
|
| |
Adding `()` to the definition of `distinct` in DataFrame allows distinct to be called with parentheses, which is consistent with `dropDuplicates`.
Author: Cheolsoo Park <cheolsoop@netflix.com>
Closes #7298 from piaozhexiu/SPARK-8908 and squashes the following commits:
7f0d923 [Cheolsoo Park] Add () to distinct definition in dataframe
|
|
|
|
|
|
|
|
|
|
|
| |
…ng-guide#Manually Specifying Options to be in sync with java,python, R version
Author: Alok Singh <“singhal@us.ibm.com”>
Closes #7299 from aloknsingh/aloknsingh_SPARK-8909 and squashes the following commits:
d3c20ba [Alok Singh] fix the file to .parquet from .json
d476140 [Alok Singh] [SPARK-8909][Documentation] Change the scala example in sql-programming-guide#Manually Specifying Options to be in sync with java,python, R version
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add documentation for NGram feature transformer.
Author: Feynman Liang <fliang@databricks.com>
Closes #7244 from feynmanliang/SPARK-8457 and squashes the following commits:
5aface9 [Feynman Liang] Pretty print Scala output and add API doc to each codetab
60d5ac0 [Feynman Liang] Inline API doc and fix indentation
736ccbc [Feynman Liang] NGram feature transformer documentation
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, CTESubstitution only handles the case that WITH is on the top of the plan.
I think it SHOULD handle the case that WITH is child of CTAS.
This patch simply changes 'match' to 'transform' for recursive search of WITH in the plan.
Author: Keuntae Park <sirpkt@apache.org>
Closes #7180 from sirpkt/SPARK-8783 and squashes the following commits:
e4428f0 [Keuntae Park] Merge remote-tracking branch 'upstream/master' into CTASwithWITH
1671c77 [Keuntae Park] WITH clause can be inside CTAS
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adding __str__ and __repr__ to DenseMatrix and SparseMatrix
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #6342 from MechCoder/spark-7785 and squashes the following commits:
7b9a82c [MechCoder] Add tests for greater than 16 elements
b88e9dd [MechCoder] Increment limit to 16
1425a01 [MechCoder] Change tests
36bd166 [MechCoder] Change str and repr representation
97f0da9 [MechCoder] zip is same as izip in python3
94ca4b2 [MechCoder] Added doctests and iterate over values instead of colPtrs
b26fa89 [MechCoder] minor
394dde9 [MechCoder] [SPARK-7785] Add __str__ and __repr__ to Matrices
|
|
|
|
|
|
|
|
|
|
| |
cc pwendell
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes #7293 from shivaram/sparkr-packages-doc and squashes the following commits:
c91471d [Shivaram Venkataraman] Fix sparkPackages in init documentation
|
|
|
|
|
|
|
|
|
|
|
| |
Fail to upload resource to viewfs in spark-1.4
JIRA Link: https://issues.apache.org/jira/browse/SPARK-8657
Author: Tao Li <litao@sogou-inc.com>
Closes #7125 from litao-buptsse/SPARK-8657-for-master and squashes the following commits:
65b13f4 [Tao Li] [SPARK-8657] [YARN] Fail to upload resource to viewfs
|
|
|
|
|
|
|
|
|
|
| |
Just a baby step towards making it more efficient.
Author: Reynold Xin <rxin@databricks.com>
Closes #7282 from rxin/SPARK-8888 and squashes the following commits:
3da51ae [Reynold Xin] [SPARK-8888][SQL] Use java.util.HashMap in DynamicPartitionWriterContainer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We need a new data type to represent time intervals. Because we can't determine how many days in a month, so we need 2 values for interval: a int `months`, a long `microseconds`.
The interval literal syntax looks like:
`interval 3 years -4 month 4 weeks 3 second`
Because we use number of 100ns as value of `TimestampType`, so it may not makes sense to support nano second unit.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #7226 from cloud-fan/interval and squashes the following commits:
632062d [Wenchen Fan] address comments
ac348c3 [Wenchen Fan] use case class
0342d2e [Wenchen Fan] use array byte
df9256c [Wenchen Fan] fix style
fd6f18a [Wenchen Fan] address comments
1856af3 [Wenchen Fan] support interval type
|
|
|
|
|
|
|
|
| |
Author: Davies Liu <davies@databricks.com>
Closes #7272 from davies/fix_projection and squashes the following commits:
075ef76 [Davies Liu] fix codegen with BroadcastHashJion
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
types used in Hive UDF
To make UDF developers understood, throw an exception when unsupported Map<K,V> types used in Hive UDF. This fix is the same with #7248.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes #7257 from maropu/ThrowExceptionWhenMapUsed and squashes the following commits:
916099a [Takeshi YAMAMURO] Fix style errors
7886dcc [Takeshi YAMAMURO] Throw an exception when Map<> used in Hive UDF
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-8785
Currently, the parquet schema merging (`ParquetRelation2.readSchema`) may spend much time to merge duplicate schema. We can select only non duplicate schema and merge them later.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes #7182 from viirya/improve_parquet_merging and squashes the following commits:
5cf934f [Liang-Chi Hsieh] Refactor it to make it faster.
f3411ea [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into improve_parquet_merging
a63c3ff [Liang-Chi Hsieh] Improve Parquet schema merging.
|