aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-8865] [STREAMING] FIX BUG: check key in kafka paramsguowei22015-07-091-1/+1
| | | | | | | | Author: guowei2 <guowei@growingio.com> Closes #7254 from guowei2/spark-8865 and squashes the following commits: 48ca17a [guowei2] fix contains key
* [SPARK-7902] [SPARK-6289] [SPARK-8685] [SQL] [PYSPARK] Refactor of ↵Davies Liu2015-07-097-315/+292
| | | | | | | | | | | | | | | | serialization for Python DataFrame This PR fix the long standing issue of serialization between Python RDD and DataFrame, it change to using a customized Pickler for InternalRow to enable customized unpickling (type conversion, especially for UDT), now we can support UDT for UDF, cc mengxr . There is no generated `Row` anymore. Author: Davies Liu <davies@databricks.com> Closes #7301 from davies/sql_ser and squashes the following commits: 81bef71 [Davies Liu] address comments e9217bd [Davies Liu] add regression tests db34167 [Davies Liu] Refactor of serialization for Python DataFrame
* [SPARK-8389] [STREAMING] [PYSPARK] Expose KafkaRDDs offsetRange in Pythonjerryshao2015-07-094-11/+196
| | | | | | | | | | | | | | | | | | | | | | This PR propose a simple way to expose OffsetRange in Python code, also the usage of offsetRanges is similar to Scala/Java way, here in Python we could get OffsetRange like: ``` dstream.foreachRDD(lambda r: KafkaUtils.offsetRanges(r)) ``` Reason I didn't follow the way what SPARK-8389 suggested is that: Python Kafka API has one more step to decode the message compared to Scala/Java, Which makes Python API return a transformed RDD/DStream, not directly wrapped so-called JavaKafkaRDD, so it is hard to backtrack to the original RDD to get the offsetRange. Author: jerryshao <saisai.shao@intel.com> Closes #7185 from jerryshao/SPARK-8389 and squashes the following commits: 4c6d320 [jerryshao] Another way to fix subclass deserialization issue e6a8011 [jerryshao] Address the comments fd13937 [jerryshao] Fix serialization bug 7debf1c [jerryshao] bug fix cff3893 [jerryshao] refactor the code according to the comments 2aabf9e [jerryshao] Style fix 848c708 [jerryshao] Add HasOffsetRanges for Python
* [SPARK-8701] [STREAMING] [WEBUI] Add input metadata in the batch pagezsxwing2015-07-0916-51/+148
| | | | | | | | | | | | | | | | | | | | | | This PR adds `metadata` to `InputInfo`. `InputDStream` can report its metadata for a batch and it will be shown in the batch page. For example, ![screen shot](https://cloud.githubusercontent.com/assets/1000778/8403741/d6ffc7e2-1e79-11e5-9888-c78c1575123a.png) FileInputDStream will display the new files for a batch, and DirectKafkaInputDStream will display its offset ranges. Author: zsxwing <zsxwing@gmail.com> Closes #7081 from zsxwing/input-metadata and squashes the following commits: f7abd9b [zsxwing] Revert the space changes in project/MimaExcludes.scala d906209 [zsxwing] Merge branch 'master' into input-metadata 74762da [zsxwing] Fix MiMa tests 7903e33 [zsxwing] Merge branch 'master' into input-metadata 450a46c [zsxwing] Address comments 1d94582 [zsxwing] Raname InputInfo to StreamInputInfo and change "metadata" to Map[String, Any] d496ae9 [zsxwing] Add input metadata in the batch page
* [SPARK-6287] [MESOS] Add dynamic allocation to the coarse-grained Mesos ↵Iulian Dragos2015-07-096-56/+331
| | | | | | | | | | | | | | | | scheduler This is largely based on extracting the dynamic allocation parts from tnachen's #3861. Author: Iulian Dragos <jaguarul@gmail.com> Closes #4984 from dragos/issue/mesos-coarse-dynamicAllocation and squashes the following commits: 39df8cd [Iulian Dragos] Update tests to latest changes in core. 9d2c9fa [Iulian Dragos] Remove adjustment of executorLimitOption in doKillExecutors. 8b00f52 [Iulian Dragos] Latest round of reviews. 0cd00e0 [Iulian Dragos] Add persistent shuffle directory 15c45c1 [Iulian Dragos] Add dynamic allocation to the Spark coarse-grained scheduler.
* [SPARK-2017] [UI] Stage page hangs with many tasksAndrew Or2015-07-091-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | (This reopens a patch that was closed in the past: #6248) When you view the stage page while running the following: ``` sc.parallelize(1 to X, 10000).count() ``` The page never loads, the job is stalled, and you end up running into an OOM: ``` HTTP ERROR 500 Problem accessing /stages/stage/. Reason: Server Error Caused by: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2367) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130) ``` This patch compresses Jetty responses in gzip. The correct long-term fix is to add pagination. Author: Andrew Or <andrew@databricks.com> Closes #7296 from andrewor14/gzip-jetty and squashes the following commits: a051c64 [Andrew Or] Use GZIP to compress Jetty responses
* [SPARK-7419] [STREAMING] [TESTS] Fix CheckpointSuite.recovery with file ↵zsxwing2015-07-091-8/+10
| | | | | | | | | | | | | | | input stream Fix this failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2886/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming/CheckpointSuite/recovery_with_file_input_stream/ To reproduce this failure, you can add `Thread.sleep(2000)` before this line https://github.com/apache/spark/blob/a9c4e29950a14e32acaac547e9a0e8879fd37fc9/streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala#L477 Author: zsxwing <zsxwing@gmail.com> Closes #7323 from zsxwing/SPARK-7419 and squashes the following commits: b3caf58 [zsxwing] Fix CheckpointSuite.recovery with file input stream
* [SPARK-8953] SPARK_EXECUTOR_CORES is not read in SparkSubmitxutingjun2015-07-091-0/+1
| | | | | | | | | | The configuration ```SPARK_EXECUTOR_CORES``` won't put into ```SparkConf```, so it has no effect to the dynamic executor allocation. Author: xutingjun <xutingjun@huawei.com> Closes #7322 from XuTingjun/SPARK_EXECUTOR_CORES and squashes the following commits: 2cafa89 [xutingjun] make SPARK_EXECUTOR_CORES has effect to dynamicAllocation
* [MINOR] [STREAMING] Fix log statements in ReceiverSupervisorImplTathagata Das2015-07-091-3/+3
| | | | | | | | | | Log statements incorrectly showed that the executor was being stopped when receiver was being stopped. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #7328 from tdas/fix-log and squashes the following commits: 9cc6e99 [Tathagata Das] Fix log statements.
* [SPARK-8247] [SPARK-8249] [SPARK-8252] [SPARK-8254] [SPARK-8257] ↵Cheng Hao2015-07-097-24/+1202
| | | | | | | | | | | [SPARK-8258] [SPARK-8259] [SPARK-8261] [SPARK-8262] [SPARK-8253] [SPARK-8260] [SPARK-8267] [SQL] Add String Expressions Author: Cheng Hao <hao.cheng@intel.com> Closes #6762 from chenghao-intel/str_funcs and squashes the following commits: b09a909 [Cheng Hao] update the code as feedback 7ebbf4c [Cheng Hao] Add more string expressions
* [SPARK-8703] [ML] Add CountVectorizer as a ml transformer to convert ↵Yuhao Yang2015-07-092-0/+155
| | | | | | | | | | | | | | | | | | | | | | | | document to words count vector jira: https://issues.apache.org/jira/browse/SPARK-8703 Converts a text document to a sparse vector of token counts. I can further add an estimator to extract vocabulary from corpus if that's appropriate. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #7084 from hhbyyh/countVectorization and squashes the following commits: 5f3f655 [Yuhao Yang] text change 24728e4 [Yuhao Yang] style improvement 576728a [Yuhao Yang] rename to model and some fix 1deca28 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into countVectorization 99b0c14 [Yuhao Yang] undo extension from HashingTF 12c2dc8 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into countVectorization 7ee1c31 [Yuhao Yang] extends HashingTF 809fb59 [Yuhao Yang] minor fix for ut 7c61fb3 [Yuhao Yang] add countVectorizer
* [SPARK-8863] [EC2] Check aws access key from aws credentials if there is no ↵JPark2015-07-091-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | boto config 'spark_ec2.py' use boto to control ec2. And boto can support '~/.aws/credentials' which is AWS CLI default configuration file. We can check this information from ref of boto. "A boto config file is a text file formatted like an .ini configuration file that specifies values for options that control the behavior of the boto library. In Unix/Linux systems, on startup, the boto library looks for configuration files in the following locations and in the following order: /etc/boto.cfg - for site-wide settings that all users on this machine will use (if profile is given) ~/.aws/credentials - for credentials shared between SDKs (if profile is given) ~/.boto - for user-specific settings ~/.aws/credentials - for credentials shared between SDKs ~/.boto - for user-specific settings" * ref of boto: http://boto.readthedocs.org/en/latest/boto_config_tut.html * ref of aws cli : http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html However 'spark_ec2.py' only check boto config & environment variable even if there is '~/.aws/credentials', and 'spark_ec2.py' is terminated. So I changed to check '~/.aws/credentials'. cc rxin Jira : https://issues.apache.org/jira/browse/SPARK-8863 Author: JPark <JPark@JPark.me> Closes #7252 from JuhongPark/master and squashes the following commits: 23c5792 [JPark] Check aws access key from aws credentials if there is no boto config
* [SPARK-8938][SQL] Implement toString for Interval data typeWenchen Fan2015-07-093-6/+119
| | | | | | | | Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7315 from cloud-fan/toString and squashes the following commits: 4fc8d80 [Wenchen Fan] Implement toString for Interval data type
* [SPARK-8926][SQL] Code review followup.Reynold Xin2015-07-094-6/+23
| | | | | | | | | | | I merged https://github.com/apache/spark/pull/7303 so it unblocks another PR. This addresses my own code review comment for that PR. Author: Reynold Xin <rxin@databricks.com> Closes #7313 from rxin/adt and squashes the following commits: 7ade82b [Reynold Xin] Fixed unit tests. f8d5533 [Reynold Xin] [SPARK-8926][SQL] Code review followup.
* [SPARK-8948][SQL] Remove ExtractValueWithOrdinal abstract classReynold Xin2015-07-091-20/+34
| | | | | | | | | | | | Also added more documentation for the file. Author: Reynold Xin <rxin@databricks.com> Closes #7316 from rxin/extract-value and squashes the following commits: 069cb7e [Reynold Xin] Removed ExtractValueWithOrdinal. 621b705 [Reynold Xin] Reverted a line. 11ebd6c [Reynold Xin] [Minor][SQL] Improve documentation for complex type extractors.
* [SPARK-8940] [SPARKR] Don't overwrite given schema in createDataFrameLiang-Chi Hsieh2015-07-091-1/+3
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-8940 The given `schema` parameter will be overwritten in `createDataFrame` now. If it is not null, we shouldn't overwrite it. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7311 from viirya/df_not_overwrite_schema and squashes the following commits: 2385139 [Liang-Chi Hsieh] Don't overwrite given schema if it is not null.
* [SPARK-8830] [SQL] native levenshtein distanceTarek Auel2015-07-094-7/+97
| | | | | | | | | | | | | | | | | Jira: https://issues.apache.org/jira/browse/SPARK-8830 rxin and HuJiayin can you have a look on it. Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7236 from tarekauel/native-levenshtein-distance and squashes the following commits: ee4c4de [Tarek Auel] [SPARK-8830] implemented improvement proposals c252e71 [Tarek Auel] [SPARK-8830] removed chartAt; use unsafe method for byte array comparison ddf2222 [Tarek Auel] Merge branch 'master' into native-levenshtein-distance 179920a [Tarek Auel] [SPARK-8830] added description 5e9ed54 [Tarek Auel] [SPARK-8830] removed StringUtils import dce4308 [Tarek Auel] [SPARK-8830] native levenshtein distance
* [SPARK-8931] [SQL] Fallback to interpreted evaluation if failed to compile ↵Davies Liu2015-07-092-6/+58
| | | | | | | | | | | | | | | | | in codegen Exception will not be catched during tests. cc marmbrus rxin Author: Davies Liu <davies@databricks.com> Closes #7309 from davies/fallback and squashes the following commits: 969a612 [Davies Liu] throw exception during tests f844f77 [Davies Liu] fallback a3091bc [Davies Liu] Merge branch 'master' of github.com:apache/spark into fallback 364a0d6 [Davies Liu] fallback to interpret mode if failed to compile
* [SPARK-6266] [MLLIB] PySpark SparseVector missing doc for size, indices, valueslewuathe2015-07-091-2/+7
| | | | | | | | | | | | Write missing pydocs in `SparseVector` attributes. Author: lewuathe <lewuathe@me.com> Closes #7290 from Lewuathe/SPARK-6266 and squashes the following commits: 51d9895 [lewuathe] Update docs 0480d35 [lewuathe] Merge branch 'master' into SPARK-6266 ba42cf3 [lewuathe] [SPARK-6266] PySpark SparseVector missing doc for size, indices, values
* [SPARK-8942][SQL] use double not decimal when cast double and float to timestampWenchen Fan2015-07-091-12/+6
| | | | | | | | Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7312 from cloud-fan/minor and squashes the following commits: a4589fa [Wenchen Fan] use double not decimal when cast double and float to timestamp
* [SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when ↵Weizhong Lin2015-07-082-7/+9
| | | | | | | | | | | | handling Parquet LISTs in compatible mode This PR is based on #7209 authored by Sephiroth-Lin. Author: Weizhong Lin <linweizhong@huawei.com> Closes #7314 from liancheng/spark-8928 and squashes the following commits: 75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when handling LISTs in compatible mode
* Revert "[SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- ↵Cheng Lian2015-07-082-9/+7
| | | | | | when handling Parquet LISTs in compatible mode" This reverts commit 3dab0da42940a46f0c4aa4853bdb5c64c4cb2613.
* [SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when ↵Cheng Lian2015-07-082-7/+9
| | | | | | | | | | | | handling Parquet LISTs in compatible mode This PR is based on #7209 authored by Sephiroth-Lin. Author: Weizhong Lin <linweizhong@huawei.com> Closes #7304 from liancheng/spark-8928 and squashes the following commits: 75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when handling LISTs in compatible mode
* Closes #7310.Reynold Xin2015-07-080-0/+0
|
* [SPARK-8926][SQL] Good errors for ExpectsInputType expressionsMichael Armbrust2015-07-0813-143/+256
| | | | | | | | | | | | | For example: `cannot resolve 'testfunction(null)' due to data type mismatch: argument 1 is expected to be of type int, however, null is of type datetype.` Author: Michael Armbrust <michael@databricks.com> Closes #7303 from marmbrus/expectsTypeErrors and squashes the following commits: c654a0e [Michael Armbrust] fix udts and make errors pretty 137160d [Michael Armbrust] style 5428fda [Michael Armbrust] style 10fac82 [Michael Armbrust] [SPARK-8926][SQL] Good errors for ExpectsInputType expressions
* [SPARK-8937] [TEST] A setting `spark.unsafe.exceptionOnMemoryLeak ` is ↵Kousuke Saruta2015-07-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | missing in ScalaTest config. `spark.unsafe.exceptionOnMemoryLeak` is present in the config of surefire. ``` <!-- Surefire runs all Java tests --> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-surefire-plugin</artifactId> <version>2.18.1</version> <!-- Note config is repeated in scalatest config --> ... <spark.unsafe.exceptionOnMemoryLeak>true</spark.unsafe.exceptionOnMemoryLeak> </systemProperties> ... ``` but is absent in the config ScalaTest. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #7308 from sarutak/add-setting-for-memory-leak and squashes the following commits: 95644e7 [Kousuke Saruta] Added a setting for memory leak
* [SPARK-8910] Fix MiMa flaky due to port contention issueAndrew Or2015-07-082-7/+8
| | | | | | | | | | | | Due to the way MiMa works, we currently start a `SQLContext` pretty early on. This causes us to start a `SparkUI` that attempts to bind to port 4040. Because many tests run in parallel on the Jenkins machines, this causes port contention sometimes and fails the MiMa tests. Note that we already disabled the SparkUI for scalatests. However, the MiMa test is run before we even have a chance to load the default scalatest settings, so we need to explicitly disable the UI ourselves. Author: Andrew Or <andrew@databricks.com> Closes #7300 from andrewor14/mima-flaky and squashes the following commits: b55a547 [Andrew Or] Do not enable SparkUI during tests
* [SPARK-8932] Support copy() for UnsafeRows that do not use ObjectPoolsJosh Rosen2015-07-084-19/+87
| | | | | | | | | | | | | | | | We call Row.copy() in many places throughout SQL but UnsafeRow currently throws UnsupportedOperationException when copy() is called. Supporting copying when ObjectPool is used may be difficult, since we may need to handle deep-copying of objects in the pool. In addition, this copy() method needs to produce a self-contained row object which may be passed around / buffered by downstream code which does not understand the UnsafeRow format. In the long run, we'll need to figure out how to handle the ObjectPool corner cases, but this may be unnecessary if other changes are made. Therefore, in order to unblock my sort patch (#6444) I propose that we support copy() for the cases where UnsafeRow does not use an ObjectPool and continue to throw UnsupportedOperationException when an ObjectPool is used. This patch accomplishes this by modifying UnsafeRow so that it knows the size of the row's backing data in order to be able to copy it into a byte array. Author: Josh Rosen <joshrosen@databricks.com> Closes #7306 from JoshRosen/SPARK-8932 and squashes the following commits: 338e6bf [Josh Rosen] Support copy for UnsafeRows that do not use ObjectPools.
* [SPARK-8866][SQL] use 1us precision for timestamp typeYijie Shen2015-07-0811-50/+50
| | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-8866 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7283 from yijieshen/micro_timestamp and squashes the following commits: dc735df [Yijie Shen] update CastSuite to avoid round error 714eaea [Yijie Shen] add timestamp_udf into blacklist due to precision lose c3ca2f4 [Yijie Shen] fix unhandled case in CurrentTimestamp 8d4aa6b [Yijie Shen] use 1us precision for timestamp type
* [SPARK-8927] [DOCS] Format wrong for some config descriptionsJonathan Alter2015-07-091-2/+2
| | | | | | | | | | A couple descriptions were not inside `<td></td>` and were being displayed immediately under the section title instead of in their row. Author: Jonathan Alter <jonalter@users.noreply.github.com> Closes #7292 from jonalter/docs-config and squashes the following commits: 5ce1570 [Jonathan Alter] [DOCS] Format wrong for some config descriptions
* [SPARK-8450] [SQL] [PYSARK] cleanup type converter for Python DataFrameDavies Liu2015-07-089-94/+84
| | | | | | | | | | | | | | | | | | | | This PR fixes the converter for Python DataFrame, especially for DecimalType Closes #7106 Author: Davies Liu <davies@databricks.com> Closes #7131 from davies/decimal_python and squashes the following commits: 4d3c234 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python 20531d6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python 7d73168 [Davies Liu] fix conflit 6cdd86a [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python 7104e97 [Davies Liu] improve type infer 9cd5a21 [Davies Liu] run python tests with SPARK_PREPEND_CLASSES 829a05b [Davies Liu] fix UDT in python c99e8c5 [Davies Liu] fix mima c46814a [Davies Liu] convert decimal for Python DataFrames
* [SPARK-8914][SQL] Remove RDDApiKousuke Saruta2015-07-083-87/+24
| | | | | | | | | | | As rxin suggested in #7298 , we should consider to remove `RDDApi`. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #7302 from sarutak/remove-rddapi and squashes the following commits: e495d35 [Kousuke Saruta] Fixed mima cb7ebb9 [Kousuke Saruta] Removed overriding RDDApi
* [SPARK-5016] [MLLIB] Distribute GMM mixture components to executorsFeynman Liang2015-07-081-8/+36
| | | | | | | | | | | | | | | | | | | | Distribute expensive portions of computation for Gaussian mixture components (in particular, pre-computation of `MultivariateGaussian.rootSigmaInv`, the inverse covariance matrix and covariance determinant) across executors. Repost of PR#4654. Notes for reviewers: * What should be the policy for when to distribute computation. Always? When numClusters > threshold? User-specified param? TODO: * Performance testing and comparison for large number of clusters Author: Feynman Liang <fliang@databricks.com> Closes #7166 from feynmanliang/GMM_parallel_mixtures and squashes the following commits: 4f351fa [Feynman Liang] Update heuristic and scaladoc 5ea947e [Feynman Liang] Fix parallelization logic 00eb7db [Feynman Liang] Add helper method for GMM's M step, remove distributeGaussians flag e7c8127 [Feynman Liang] Add distributeGaussians flag and tests 1da3c7f [Feynman Liang] Distribute mixtures
* [SPARK-8877] [MLLIB] Public API for association rule generationFeynman Liang2015-07-083-3/+55
| | | | | | | | | | | Adds FPGrowth.generateAssociationRules to public API for generating association rules after mining frequent itemsets. Author: Feynman Liang <fliang@databricks.com> Closes #7271 from feynmanliang/SPARK-8877 and squashes the following commits: 83b8baf [Feynman Liang] Add API Doc 867abff [Feynman Liang] Add FPGrowth.generateAssociationRules and change access modifiers for AssociationRules
* [SPARK-8068] [MLLIB] Add confusionMatrix method at class MulticlassMetrics ↵Yanbo Liang2015-07-081-0/+11
| | | | | | | | | | | | in pyspark/mllib Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib Author: Yanbo Liang <ybliang8@gmail.com> Closes #7286 from yanboliang/spark-8068 and squashes the following commits: 6109fe1 [Yanbo Liang] Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
* [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for ↵Cheng Lian2015-07-0827-919/+5984
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | interoperability and backwards-compatibility This PR is a follow-up of #6617 and is part of [SPARK-6774] [2], which aims to ensure interoperability and backwards-compatibility for Spark SQL Parquet support. And this one fixes the read path. Now Spark SQL is expected to be able to read legacy Parquet data files generated by most (if not all) common libraries/tools like parquet-thrift, parquet-avro, and parquet-hive. However, we still need to refactor the write path to write standard Parquet LISTs and MAPs ([SPARK-8848] [4]). ### Major changes 1. `CatalystConverter` class hierarchy refactoring - Replaces `CatalystConverter` trait with a much simpler `ParentContainerUpdater`. Now instead of extending the original `CatalystConverter` trait, every converter class accepts an updater which is responsible for propagating the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row. Root converter doesn't have a parent and thus uses a `NoopUpdater`. This simplifies the design since converters don't need to care about details of their parent converters anymore. - Unifies `CatalystRootConverter`, `CatalystGroupConverter` and `CatalystPrimitiveRowConverter` into `CatalystRowConverter` Specifically, now all row objects are represented by `SpecificMutableRow` during conversion. - Refactors `CatalystArrayConverter`, and removes `CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter` `CatalystNativeArrayConverter` was probably designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal. The new `CatalystArrayConverter` handles both nullable and non-nullable array elements in a consistent way. - Implements backwards-compatibility rules in `CatalystArrayConverter` When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see [backwards-compatibility rules for MAP] [1]), we only need to deal with LIST (namely array) in `CatalystArrayConverter`. 2. Requested columns handling When specifying requested columns in `RowReadSupport`, we used to use a Parquet `MessageType` converted from a Catalyst `StructType` which contains all requested columns. This is not preferable when taking compatibility and interoperability into consideration. Because the actual Parquet file may have different physical structure from the converted schema. In this PR, the schema for requested columns is constructed using the following method: - For a column that exists in the target Parquet file, we extract the column type by name from the full file schema, and construct a single-field `MessageType` for that column. - For a column that doesn't exist in the target Parquet file, we create a single-field `StructType` and convert it to a `MessageType` using `CatalystSchemaConverter`. - Unions all single-field `MessageType`s into a full schema containing all requested fields With this change, we also fix [SPARK-6123] [3] by validating the global schema against each individual Parquet part-files. ### Testing This PR also adds compatibility tests for parquet-avro, parquet-thrift, and parquet-hive. Please refer to `README.md` under `sql/core/src/test` for more information about these tests. To avoid build time code generation and adding extra complexity to the build system, Java code generated from testing Thrift schema and Avro IDL is also checked in. [1]: https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1 [2]: https://issues.apache.org/jira/browse/SPARK-6774 [3]: https://issues.apache.org/jira/browse/SPARK-6123 [4]: https://issues.apache.org/jira/browse/SPARK-8848 Author: Cheng Lian <lian@databricks.com> Closes #7231 from liancheng/spark-6776 and squashes the following commits: 360fe18 [Cheng Lian] Adds ParquetHiveCompatibilitySuite c6fbc06 [Cheng Lian] Removes WIP file committed by mistake b8c1295 [Cheng Lian] Excludes the whole parquet package from MiMa 598c3e8 [Cheng Lian] Adds extra Maven repo for hadoop-lzo, which is a transitive dependency of parquet-thrift 926af87 [Cheng Lian] Simplifies Parquet compatibility test suites 7946ee1 [Cheng Lian] Fixes Scala styling issues 3d7ab36 [Cheng Lian] Fixes .rat-excludes a8f13bb [Cheng Lian] Using Parquet writer API to do compatibility tests f2208cd [Cheng Lian] Adds README.md for Thrift/Avro code generation 1d390aa [Cheng Lian] Adds parquet-thrift compatibility test 440f7b3 [Cheng Lian] Adds generated files to .rat-excludes 13b9121 [Cheng Lian] Adds ParquetAvroCompatibilitySuite 06cfe9d [Cheng Lian] Adds comments about TimestampType handling a099d3e [Cheng Lian] More comments 0cc1b37 [Cheng Lian] Fixes MiMa checks 884d3e6 [Cheng Lian] Fixes styling issue and reverts unnecessary changes 802cbd7 [Cheng Lian] Fixes bugs related to schema merging and empty requested columns 38fe1e7 [Cheng Lian] Adds explicit return type 7fb21f1 [Cheng Lian] Reverts an unnecessary debugging change 1781dff [Cheng Lian] Adds test case for SPARK-8811 6437d4b [Cheng Lian] Assembles requested schema from Parquet file schema bcac49f [Cheng Lian] Removes the 16-byte restriction of decimals a74fb2c [Cheng Lian] More comments 0525346 [Cheng Lian] Removes old Parquet record converters 03c3bd9 [Cheng Lian] Refactors Parquet read path to implement backwards-compatibility rules
* [SPARK-8902] Correctly print hostname in errorDaniel Darabos2015-07-091-2/+2
| | | | | | | | | | | | With "+" the strings are separate expressions, and format() is called on the last string before concatenation. (So substitution does not happen.) Without "+" the string literals are merged first by the parser, so format() is called on the complete string. Should I make a JIRA for this? Author: Daniel Darabos <darabos.daniel@gmail.com> Closes #7288 from darabos/patch-2 and squashes the following commits: be0d3b7 [Daniel Darabos] Correctly print hostname in error
* [SPARK-8700][ML] Disable feature scaling in Logistic RegressionDB Tsai2015-07-085-117/+384
| | | | | | | | | | | | | | | | | | | | All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same. In R, there is an option for this. `standardize` Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian". +cc holdenk mengxr jkbradley Author: DB Tsai <dbt@netflix.com> Closes #7080 from dbtsai/lors and squashes the following commits: 877e6c7 [DB Tsai] repahse the doc 7cf45f2 [DB Tsai] address feedback 78d75c9 [DB Tsai] small change c2c9e60 [DB Tsai] style 6e1a8e0 [DB Tsai] first commit
* [SPARK-8908] [SQL] Add () to distinct definition in dataframeCheolsoo Park2015-07-081-1/+1
| | | | | | | | | | Adding `()` to the definition of `distinct` in DataFrame allows distinct to be called with parentheses, which is consistent with `dropDuplicates`. Author: Cheolsoo Park <cheolsoop@netflix.com> Closes #7298 from piaozhexiu/SPARK-8908 and squashes the following commits: 7f0d923 [Cheolsoo Park] Add () to distinct definition in dataframe
* [SPARK-8909][Documentation] Change the scala example in sql-programmi…Alok Singh2015-07-081-1/+1
| | | | | | | | | | | …ng-guide#Manually Specifying Options to be in sync with java,python, R version Author: Alok Singh <“singhal@us.ibm.com”> Closes #7299 from aloknsingh/aloknsingh_SPARK-8909 and squashes the following commits: d3c20ba [Alok Singh] fix the file to .parquet from .json d476140 [Alok Singh] [SPARK-8909][Documentation] Change the scala example in sql-programming-guide#Manually Specifying Options to be in sync with java,python, R version
* [SPARK-8457] [ML] NGram DocumentationFeynman Liang2015-07-081-0/+88
| | | | | | | | | | | | Add documentation for NGram feature transformer. Author: Feynman Liang <fliang@databricks.com> Closes #7244 from feynmanliang/SPARK-8457 and squashes the following commits: 5aface9 [Feynman Liang] Pretty print Scala output and add API doc to each codetab 60d5ac0 [Feynman Liang] Inline API doc and fix indentation 736ccbc [Feynman Liang] NGram feature transformer documentation
* [SPARK-8783] [SQL] CTAS with WITH clause does not workKeuntae Park2015-07-082-1/+19
| | | | | | | | | | | | | Currently, CTESubstitution only handles the case that WITH is on the top of the plan. I think it SHOULD handle the case that WITH is child of CTAS. This patch simply changes 'match' to 'transform' for recursive search of WITH in the plan. Author: Keuntae Park <sirpkt@apache.org> Closes #7180 from sirpkt/SPARK-8783 and squashes the following commits: e4428f0 [Keuntae Park] Merge remote-tracking branch 'upstream/master' into CTASwithWITH 1671c77 [Keuntae Park] WITH clause can be inside CTAS
* [SPARK-7785] [MLLIB] [PYSPARK] Add __str__ and __repr__ to MatricesMechCoder2015-07-082-1/+178
| | | | | | | | | | | | | | | | | Adding __str__ and __repr__ to DenseMatrix and SparseMatrix Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6342 from MechCoder/spark-7785 and squashes the following commits: 7b9a82c [MechCoder] Add tests for greater than 16 elements b88e9dd [MechCoder] Increment limit to 16 1425a01 [MechCoder] Change tests 36bd166 [MechCoder] Change str and repr representation 97f0da9 [MechCoder] zip is same as izip in python3 94ca4b2 [MechCoder] Added doctests and iterate over values instead of colPtrs b26fa89 [MechCoder] minor 394dde9 [MechCoder] [SPARK-7785] Add __str__ and __repr__ to Matrices
* [SPARK-8900] [SPARKR] Fix sparkPackages in init documentationShivaram Venkataraman2015-07-081-1/+1
| | | | | | | | | | cc pwendell Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #7293 from shivaram/sparkr-packages-doc and squashes the following commits: c91471d [Shivaram Venkataraman] Fix sparkPackages in init documentation
* [SPARK-8657] [YARN] Fail to upload resource to viewfsTao Li2015-07-081-1/+2
| | | | | | | | | | | Fail to upload resource to viewfs in spark-1.4 JIRA Link: https://issues.apache.org/jira/browse/SPARK-8657 Author: Tao Li <litao@sogou-inc.com> Closes #7125 from litao-buptsse/SPARK-8657-for-master and squashes the following commits: 65b13f4 [Tao Li] [SPARK-8657] [YARN] Fail to upload resource to viewfs
* [SPARK-8888][SQL] Use java.util.HashMap in DynamicPartitionWriterContainer.Reynold Xin2015-07-081-13/+23
| | | | | | | | | | Just a baby step towards making it more efficient. Author: Reynold Xin <rxin@databricks.com> Closes #7282 from rxin/SPARK-8888 and squashes the following commits: 3da51ae [Reynold Xin] [SPARK-8888][SQL] Use java.util.HashMap in DynamicPartitionWriterContainer.
* [SPARK-8753][SQL] Create an IntervalType data typeWenchen Fan2015-07-087-20/+185
| | | | | | | | | | | | | | | | | | | | We need a new data type to represent time intervals. Because we can't determine how many days in a month, so we need 2 values for interval: a int `months`, a long `microseconds`. The interval literal syntax looks like: `interval 3 years -4 month 4 weeks 3 second` Because we use number of 100ns as value of `TimestampType`, so it may not makes sense to support nano second unit. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7226 from cloud-fan/interval and squashes the following commits: 632062d [Wenchen Fan] address comments ac348c3 [Wenchen Fan] use case class 0342d2e [Wenchen Fan] use array byte df9256c [Wenchen Fan] fix style fd6f18a [Wenchen Fan] address comments 1856af3 [Wenchen Fan] support interval type
* [SPARK-5707] [SQL] fix serialization of generated projectionDavies Liu2015-07-083-4/+3
| | | | | | | | Author: Davies Liu <davies@databricks.com> Closes #7272 from davies/fix_projection and squashes the following commits: 075ef76 [Davies Liu] fix codegen with BroadcastHashJion
* [SPARK-6912] [SQL] Throw an AnalysisException when unsupported Java Map<K,V> ↵Takeshi YAMAMURO2015-07-084-0/+108
| | | | | | | | | | | | | types used in Hive UDF To make UDF developers understood, throw an exception when unsupported Map<K,V> types used in Hive UDF. This fix is the same with #7248. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #7257 from maropu/ThrowExceptionWhenMapUsed and squashes the following commits: 916099a [Takeshi YAMAMURO] Fix style errors 7886dcc [Takeshi YAMAMURO] Throw an exception when Map<> used in Hive UDF
* [SPARK-8785] [SQL] Improve Parquet schema mergingLiang-Chi Hsieh2015-07-081-34/+48
| | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-8785 Currently, the parquet schema merging (`ParquetRelation2.readSchema`) may spend much time to merge duplicate schema. We can select only non duplicate schema and merge them later. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7182 from viirya/improve_parquet_merging and squashes the following commits: 5cf934f [Liang-Chi Hsieh] Refactor it to make it faster. f3411ea [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into improve_parquet_merging a63c3ff [Liang-Chi Hsieh] Improve Parquet schema merging.