aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-12269][STREAMING][KINESIS] Update aws-java-sdk versionBrianLondon2016-01-115-19/+19
| | | | | | | | The current Spark Streaming kinesis connector references a quite old version 1.9.40 of the AWS Java SDK (1.10.40 is current). Numerous AWS features including Kinesis Firehose are unavailable in 1.9. Those two versions of the AWS SDK in turn require conflicting versions of Jackson (2.4.4 and 2.5.3 respectively) such that one cannot include the current AWS SDK in a project that also uses the Spark Streaming Kinesis ASL. Author: BrianLondon <brian@seatgeek.com> Closes #10256 from BrianLondon/master.
* removed lambda from sortByKey()Udo Klein2016-01-111-1/+1
| | | | | | | | According to the documentation the sortByKey method does not take a lambda as an argument, thus the example is flawed. Removed the argument completely as this will default to ascending sort. Author: Udo Klein <git@blinkenlight.net> Closes #10640 from udoklein/patch-1.
* [SPARK-12539][FOLLOW-UP] always sort in partitioning writerWenchen Fan2016-01-112-147/+48
| | | | | | | | | | | address comments in #10498 , especially https://github.com/apache/spark/pull/10498#discussion_r49021259 Author: Wenchen Fan <wenchen@databricks.com> This patch had conflicts when merged, resolved by Committer: Reynold Xin <rxin@databricks.com> Closes #10638 from cloud-fan/bucket-write.
* [SPARK-12734][HOTFIX][TEST-MAVEN] Fix bug in Netty exclusionsJosh Rosen2016-01-116-47/+11
| | | | | | | | | | This is a hotfix for a build bug introduced by the Netty exclusion changes in #10672. We can't exclude `io.netty:netty` because Akka depends on it. There's not a direct conflict between `io.netty:netty` and `io.netty:netty-all`, because the former puts classes in the `org.jboss.netty` namespace while the latter uses the `io.netty` namespace. However, there still is a conflict between `org.jboss.netty:netty` and `io.netty:netty`, so we need to continue to exclude the JBoss version of that artifact. While the diff here looks somewhat large, note that this is only a revert of a some of the changes from #10672. You can see the net changes in pom.xml at https://github.com/apache/spark/compare/3119206b7188c23055621dfeaf6874f21c711a82...5211ab8#diff-600376dffeb79835ede4a0b285078036 Author: Josh Rosen <joshrosen@databricks.com> Closes #10693 from JoshRosen/netty-hotfix.
* [SPARK-4628][BUILD] Add a resolver to MiMaBuild.scala for mqttv3(1.0.1).Kousuke Saruta2016-01-101-1/+6
| | | | | | | | | | | #10659 removed the repository `https://repo.eclipse.org/content/repositories/paho-releases` but it's needed by MiMa because `spark-streaming-mqtt(1.6.0)` depends on `mqttv3(1.0.1)` and it is provided by the removed repository and maven-central provide only `mqttv3(1.0.2)` for now. Otherwise, if `mqttv3(1.0.1)` is absent from the local repository, dev/mima should fail. JoshRosen Do you have any other better idea? Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10688 from sarutak/SPARK-4628-followup.
* [SPARK-3873][BUILD] Enable import ordering error checking.Marcelo Vanzin2016-01-1036-64/+62
| | | | | | | | | | | | | Turn import ordering violations into build errors, plus a few adjustments to account for how the checker behaves. I'm a little on the fence about whether the existing code is right, but it's easier to appease the checker than to discuss what's the more correct order here. Plus a few fixes to imports that cropped in since my recent cleanups. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10612 from vanzin/SPARK-3873-enable.
* [SPARK-12734][BUILD] Fix Netty exclusion and use Maven Enforcer to prevent ↵Josh Rosen2016-01-107-18/+64
| | | | | | | | | | | | | | | future bugs Netty classes are published under multiple artifacts with different names, so our build needs to exclude the `io.netty:netty` and `org.jboss.netty:netty` versions of the Netty artifact. However, our existing exclusions were incomplete, leading to situations where duplicate Netty classes would wind up on the classpath and cause compile errors (or worse). This patch fixes the exclusion issue by adding more exclusions and uses Maven Enforcer's [banned dependencies](https://maven.apache.org/enforcer/enforcer-rules/bannedDependencies.html) rule to prevent these classes from accidentally being reintroduced. I also updated `dev/test-dependencies.sh` to run `mvn validate` so that the enforcer rules can run as part of pull request builds. /cc rxin srowen pwendell. I'd like to backport at least the exclusion portion of this fix to `branch-1.5` in order to fix the documentation publishing job, which fails nondeterministically due to incompatible versions of Netty classes taking precedence on the compile-time classpath. Author: Josh Rosen <rosenville@gmail.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #10672 from JoshRosen/enforce-netty-exclusions.
* [SPARK-12692][BUILD][GRAPHX] Scala style: Fix the style violation (Space ↵Kousuke Saruta2016-01-105-9/+8
| | | | | | | | | | | before "," or ":") Fix the style violation (space before `,` and `:`). This PR is a followup for #10643. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10683 from sarutak/SPARK-12692-followup-graphx.
* [SPARK-12692][BUILD][MLLIB] Scala style: Fix the style violation (Space ↵Kousuke Saruta2016-01-1017-22/+22
| | | | | | | | | | | before "," or ":") Fix the style violation (space before , and :). This PR is a followup for #10643. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10684 from sarutak/SPARK-12692-followup-mllib.
* [SPARK-12736][CORE][DEPLOY] Standalone Master cannot be started due t…Jacek Laskowski2016-01-101-0/+1
| | | | | | | | | | …o NoClassDefFoundError: org/spark-project/guava/collect/Maps /cc srowen rxin Author: Jacek Laskowski <jacek@japila.pl> Closes #10674 from jaceklaskowski/SPARK-12736.
* [SPARK-12735] Consolidate & move spark-ec2 to AMPLab managed repository.Reynold Xin2016-01-0914-1808/+3
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #10673 from rxin/SPARK-12735.
* Close #10665Reynold Xin2016-01-090-0/+0
|
* [SPARK-12340] Fix overflow in various take functions.Reynold Xin2016-01-096-22/+19
| | | | | | | | This is a follow-up for the original patch #10562. Author: Reynold Xin <rxin@databricks.com> Closes #10670 from rxin/SPARK-12340.
* [SPARK-12645][SPARKR] SparkR support hash functionYanbo Liang2016-01-094-1/+26
| | | | | | | | Add ```hash``` function for SparkR ```DataFrame```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10597 from yanboliang/spark-12645.
* [SPARK-12577] [SQL] Better support of parentheses in partition by and order ↵Liang-Chi Hsieh2016-01-082-11/+32
| | | | | | | | | | by clause of window function's over clause JIRA: https://issues.apache.org/jira/browse/SPARK-12577 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10620 from viirya/fix-parentheses.
* [SPARK-4628][BUILD] Remove all non-Maven-Central repositories from buildJosh Rosen2016-01-084-95/+7
| | | | | | | | | | | | | | | | | | | | | | This patch removes all non-Maven-central repositories from Spark's build, thereby avoiding any risk of future build-breaks due to us accidentally depending on an artifact which is not present in an immutable public Maven repository. I tested this by running ``` build/mvn \ -Phive \ -Phive-thriftserver \ -Pkinesis-asl \ -Pspark-ganglia-lgpl \ -Pyarn \ dependency:go-offline ``` inside of a fresh Ubuntu Docker container with no Ivy or Maven caches (I did a similar test for SBT). Author: Josh Rosen <joshrosen@databricks.com> Closes #10659 from JoshRosen/SPARK-4628.
* [SPARK-12730][TESTS] De-duplicate some test code in BlockManagerSuiteJosh Rosen2016-01-081-63/+25
| | | | | | | | This patch deduplicates some test code in BlockManagerSuite. I'm splitting this change off from a larger PR in order to make things easier to review. Author: Josh Rosen <joshrosen@databricks.com> Closes #10667 from JoshRosen/block-mgr-tests-cleanup.
* [SPARK-12593][SQL] Converts resolved logical plan back to SQLCheng Lian2016-01-0847-146/+1087
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR tries to enable Spark SQL to convert resolved logical plans back to SQL query strings. For now, the major use case is to canonicalize Spark SQL native view support. The major entry point is `SQLBuilder.toSQL`, which returns an `Option[String]` if the logical plan is recognized. The current version is still in WIP status, and is quite limited. Known limitations include: 1. The logical plan must be analyzed but not optimized The optimizer erases `Subquery` operators, which contain necessary scope information for SQL generation. Future versions should be able to recover erased scope information by inserting subqueries when necessary. 1. The logical plan must be created using HiveQL query string Query plans generated by composing arbitrary DataFrame API combinations are not supported yet. Operators within these query plans need to be rearranged into a canonical form that is more suitable for direct SQL generation. For example, the following query plan ``` Filter (a#1 < 10) +- MetastoreRelation default, src, None ``` need to be canonicalized into the following form before SQL generation: ``` Project [a#1, b#2, c#3] +- Filter (a#1 < 10) +- MetastoreRelation default, src, None ``` Otherwise, the SQL generation process will have to handle a large number of special cases. 1. Only a fraction of expressions and basic logical plan operators are supported in this PR Currently, 95.7% (1720 out of 1798) query plans in `HiveCompatibilitySuite` can be successfully converted to SQL query strings. Known unsupported components are: - Expressions - Part of math expressions - Part of string expressions (buggy?) - Null expressions - Calendar interval literal - Part of date time expressions - Complex type creators - Special `NOT` expressions, e.g. `NOT LIKE` and `NOT IN` - Logical plan operators/patterns - Cube, rollup, and grouping set - Script transformation - Generator - Distinct aggregation patterns that fit `DistinctAggregationRewriter` analysis rule - Window functions Support for window functions, generators, and cubes etc. will be added in follow-up PRs. This PR leverages `HiveCompatibilitySuite` for testing SQL generation in a "round-trip" manner: * For all select queries, we try to convert it back to SQL * If the query plan is convertible, we parse the generated SQL into a new logical plan * Run the new logical plan instead of the original one If the query plan is inconvertible, the test case simply falls back to the original logic. TODO - [x] Fix failed test cases - [x] Support for more basic expressions and logical plan operators (e.g. distinct aggregation etc.) - [x] Comments and documentation Author: Cheng Lian <lian@databricks.com> Closes #10541 from liancheng/sql-generation.
* [SPARK-4819] Remove Guava's "Optional" from public APISean Owen2016-01-0819-85/+333
| | | | | | | | | | Replace Guava `Optional` with (an API clone of) Java 8 `java.util.Optional` (edit: and a clone of Guava `Optional`) See also https://github.com/apache/spark/pull/10512 Author: Sean Owen <sowen@cloudera.com> Closes #10513 from srowen/SPARK-4819.
* [SPARK-12654] sc.wholeTextFiles with spark.hadoop.cloneConf=true fail…Thomas Graves2016-01-081-1/+8
| | | | | | | | | | | | | | | …s on secure Hadoop https://issues.apache.org/jira/browse/SPARK-12654 So the bug here is that WholeTextFileRDD.getPartitions has: val conf = getConf in getConf if the cloneConf=true it creates a new Hadoop Configuration. Then it uses that to create a new newJobContext. The newJobContext will copy credentials around, but credentials are only present in a JobConf not in a Hadoop Configuration. So basically when it is cloning the hadoop configuration its changing it from a JobConf to Configuration and dropping the credentials that were there. NewHadoopRDD just uses the conf passed in for the getPartitions (not getConf) which is why it works. Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Closes #10651 from tgravescs/SPARK-12654.
* fixed numVertices in transitive closure exampleUdo Klein2016-01-081-2/+2
| | | | | | Author: Udo Klein <git@blinkenlight.net> Closes #10642 from udoklein/patch-2.
* [DOCUMENTATION] doc fix of job schedulingJeff Zhang2016-01-081-1/+1
| | | | | | | | spark.shuffle.service.enabled is spark application related configuration, it is not necessary to set it in yarn-site.xml Author: Jeff Zhang <zjffdu@apache.org> Closes #10657 from zjffdu/doc-fix.
* [SPARK-12701][CORE] FileAppender should use join to ensure writing thread ↵Bryan Cutler2016-01-081-10/+1
| | | | | | | | | | completion Changed Logging FileAppender to use join in `awaitTermination` to ensure that thread is properly finished before returning. Author: Bryan Cutler <cutlerb@gmail.com> Closes #10654 from BryanCutler/fileAppender-join-thread-SPARK-12701.
* [SPARK-12687] [SQL] Support from clause surrounded by `()`.Liang-Chi Hsieh2016-01-083-2/+25
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-12687 Some queries such as `(select 1 as a) union (select 2 as a)` can't work. This patch fixes it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10660 from viirya/fix-union.
* [SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 editionSean Owen2016-01-0824-137/+123
| | | | | | | | Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs. Author: Sean Owen <sowen@cloudera.com> Closes #10570 from srowen/SPARK-12618.
* [SPARK-12692][BUILD] Scala style: check no white space before comma and colonKousuke Saruta2016-01-081-0/+6
| | | | | | | | | | We should not put a white space before `,` and `:` so let's check it. Because there are lots of style violations, first, I'd like to add a checker, enable and let the level `warning`. Then, I'd like to fix the style step by step. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10643 from sarutak/SPARK-12692.
* Fix indentation for the previous patch.Reynold Xin2016-01-071-10/+8
|
* [SPARK-12317][SQL] Support units (m,k,g) in SQLConfKevin Yu2016-01-072-1/+60
| | | | | | | | | | | | | | This PR is continue from previous closed PR 10314. In this PR, SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE will be taken memory string conventions as input. For example, the user can now specify 10g for SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE in SQLConf file. marmbrus srowen : Can you help review this code changes ? Thanks. Author: Kevin Yu <qyu@us.ibm.com> Closes #10629 from kevinyu98/spark-12317.
* [SPARK-12591][STREAMING] Register OpenHashMapBasedStateMap for KryoShixiong Zhu2016-01-075-41/+174
| | | | | | | | The default serializer in Kryo is FieldSerializer and it ignores transient fields and never calls `writeObject` or `readObject`. So we should register OpenHashMapBasedStateMap using `DefaultSerializer` to make it work with Kryo. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10609 from zsxwing/SPARK-12591.
* [SPARK-12507][STREAMING][DOCUMENT] Expose closeFileAfterWrite and ↵Shixiong Zhu2016-01-072-7/+23
| | | | | | | | | | allowBatching configurations for Streaming /cc tdas brkyvz Author: Shixiong Zhu <shixiong@databricks.com> Closes #10453 from zsxwing/streaming-conf.
* [SPARK-12604][CORE] Addendum - use casting vs mapValues for countBy{Key,Value}Sean Owen2016-01-072-2/+2
| | | | | | | | Per rxin, let's use the casting for countByKey and countByValue as well. Let's see if this passes. Author: Sean Owen <sowen@cloudera.com> Closes #10641 from srowen/SPARK-12604.2.
* [SPARK-12510][STREAMING] Refactor ActorReceiver to support JavaShixiong Zhu2016-01-076-20/+202
| | | | | | | | | | | | | This PR includes the following changes: 1. Rename `ActorReceiver` to `ActorReceiverSupervisor` 2. Remove `ActorHelper` 3. Add a new `ActorReceiver` for Scala and `JavaActorReceiver` for Java 4. Add `JavaActorWordCount` example Author: Shixiong Zhu <shixiong@databricks.com> Closes #10457 from zsxwing/java-actor-stream.
* [SPARK-12580][SQL] Remove string concatenations from usage and extended in ↵Kazuaki Ishizaki2016-01-072-25/+25
| | | | | | | | | | | | | | @ExpressionDescription Use multi-line string literals for ExpressionDescription with ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit`` The policy is here, as describe at https://github.com/apache/spark/pull/10488 Let's use multi-line string literals. If we have to have a line with more than 100 characters, let's use ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit`` to just bypass the line number requirement. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #10524 from kiszk/SPARK-12580.
* [SPARK-12598][CORE] bug in setMinPartitionsDarek Blasiak2016-01-071-3/+2
| | | | | | | | There is a bug in the calculation of ```maxSplitSize```. The ```totalLen``` should be divided by ```minPartitions``` and not by ```files.size```. Author: Darek Blasiak <darek.blasiak@640labs.com> Closes #10546 from datafarmer/setminpartitionsbug.
* [STREAMING][MINOR] More contextual information in logs + minor code i…Jacek Laskowski2016-01-0714-74/+69
| | | | | | | | | | …mprovements Please review and merge at your convenience. Thanks! Author: Jacek Laskowski <jacek@japila.pl> Closes #10595 from jaceklaskowski/streaming-minor-fixes.
* [MINOR] Fix for BUILD FAILURE for Scala 2.11Jacek Laskowski2016-01-071-18/+1
| | | | | | | | | | It was introduced in 917d3fc069fb9ea1c1487119c9c12b373f4f9b77 /cc cloud-fan rxin Author: Jacek Laskowski <jacek@japila.pl> Closes #10636 from jaceklaskowski/fix-for-build-failure-2.11.
* [SPARK-12662][SQL] Fix DataFrame.randomSplit to avoid creating overlapping ↵Sameer Agarwal2016-01-072-1/+28
| | | | | | | | | | | | splits https://issues.apache.org/jira/browse/SPARK-12662 cc yhuai Author: Sameer Agarwal <sameer@databricks.com> Closes #10626 from sameeragarwal/randomsplit.
* [SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not Nonezero3232016-01-072-1/+13
| | | | | | | | If initial model passed to GMM is not empty it causes net.razorvine.pickle.PickleException. It can be fixed by converting initialModel.weights to list. Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #10644 from zero323/SPARK-12006.
* [STREAMING][DOCS][EXAMPLES] Minor fixesJacek Laskowski2016-01-072-10/+8
| | | | | | Author: Jacek Laskowski <jacek@japila.pl> Closes #10603 from jaceklaskowski/streaming-actor-custom-receiver.
* [SPARK-12542][SQL] support except/intersect in HiveQlDavies Liu2016-01-065-5/+65
| | | | | | | | Parse the SQL query with except/intersect in FROM clause for HivQL. Author: Davies Liu <davies@databricks.com> Closes #10622 from davies/intersect.
* [SPARK-12295] [SQL] external spilling for window functionsDavies Liu2016-01-066-94/+276
| | | | | | | | | | This PR manage the memory used by window functions (buffered rows), also enable external spilling. After this PR, we can run window functions on a partition with hundreds of millions of rows with only 1G. Author: Davies Liu <davies@databricks.com> Closes #10605 from davies/unsafe_window.
* [DOC] fix 'spark.memory.offHeap.enabled' default value to falsezzcclp2016-01-061-1/+1
| | | | | | | | modify 'spark.memory.offHeap.enabled' default value to false Author: zzcclp <xm_zzc@sina.com> Closes #10633 from zzcclp/fix_spark.memory.offHeap.enabled_default_value.
* Revert "[SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None"Yin Huai2016-01-062-13/+1
| | | | | | | | This reverts commit fcd013cf70e7890aa25a8fe3cb6c8b36bf0e1f04. Author: Yin Huai <yhuai@databricks.com> Closes #10632 from yhuai/pythonStyle.
* [SPARK-12678][CORE] MapPartitionsRDD clearDependenciesGuillaume Poulin2016-01-061-1/+6
| | | | | | | | | MapPartitionsRDD was keeping a reference to `prev` after a call to `clearDependencies` which could lead to memory leak. Author: Guillaume Poulin <poulin.guillaume@gmail.com> Closes #10623 from gpoulin/map_partition_deps.
* [SPARK-12673][UI] Add missing uri prepending for job descriptionjerryshao2016-01-061-3/+3
| | | | | | | | | | Otherwise the url will be failed to proxy to the right one if in YARN mode. Here is the screenshot: ![screen shot 2016-01-06 at 5 28 26 pm](https://cloud.githubusercontent.com/assets/850797/12139632/bbe78ecc-b49c-11e5-8932-94e8b3622a09.png) Author: jerryshao <sshao@hortonworks.com> Closes #10618 from jerryshao/SPARK-12673.
* [SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0Josh Rosen2016-01-0612-589/+48
| | | | | | | | | | | | This PR removes `spark.cleaner.ttl` and the associated TTL-based metadata cleaning code. Now that we have the `ContextCleaner` and a timer to trigger periodic GCs, I don't think that `spark.cleaner.ttl` is necessary anymore. The TTL-based cleaning isn't enabled by default, isn't included in our end-to-end tests, and has been a source of user confusion when it is misconfigured. If the TTL is set too low, data which is still being used may be evicted / deleted, leading to hard to diagnose bugs. For all of these reasons, I think that we should remove this functionality in Spark 2.0. Additional benefits of doing this include marginally reduced memory usage, since we no longer need to store timetsamps in hashmaps, and a handful fewer threads. Author: Josh Rosen <joshrosen@databricks.com> Closes #10534 from JoshRosen/remove-ttl-based-cleaning.
* [SPARK-12663][MLLIB] More informative error message in MLUtils.loadLibSVMFileRobert Dodier2016-01-061-1/+2
| | | | | | | | | | This PR contains 1 commit which resolves [SPARK-12663](https://issues.apache.org/jira/browse/SPARK-12663). For the record, I got a positive response from 2 people when I floated this idea on devspark.apache.org on 2015-10-23. [Link to archived discussion.](http://apache-spark-developers-list.1001551.n3.nabble.com/slightly-more-informative-error-message-in-MLUtils-loadLibSVMFile-td14764.html) Author: Robert Dodier <robert_dodier@users.sourceforge.net> Closes #10611 from robert-dodier/loadlibsvmfile-error-msg-branch.
* [SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet ↵Nong Li2016-01-062-0/+278
| | | | | | | | | | | | | | | scan benchmarks. [SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks. We've run benchmarks ad hoc to measure the scanner performance. We will continue to invest in this and it makes sense to get these benchmarks into code. This adds a simple benchmarking utility to do this. Author: Nong Li <nong@databricks.com> Author: Nong <nongli@gmail.com> Closes #10589 from nongli/spark-12640.
* [SPARK-12604][CORE] Java count(AprroxDistinct)ByKey methods return Scala ↵Sean Owen2016-01-065-42/+49
| | | | | | | | | | Long not Java Change Java countByKey, countApproxDistinctByKey return types to use Java Long, not Scala; update similar methods for consistency on java.long.Long.valueOf with no API change Author: Sean Owen <sowen@cloudera.com> Closes #10554 from srowen/SPARK-12604.
* [SPARK-12539][SQL] support writing bucketed tableWenchen Fan2016-01-0618-117/+626
| | | | | | | | | | | | | | | | | | | | | | This PR adds bucket write support to Spark SQL. User can specify bucketing columns, numBuckets and sorting columns with or without partition columns. For example: ``` df.write.partitionBy("year").bucketBy(8, "country").sortBy("amount").saveAsTable("sales") ``` When bucketing is used, we will calculate bucket id for each record, and group the records by bucket id. For each group, we will create a file with bucket id in its name, and write data into it. For each bucket file, if sorting columns are specified, the data will be sorted before write. Note that there may be multiply files for one bucket, as the data is distributed. Currently we store the bucket metadata at hive metastore in a non-hive-compatible way. We use different bucketing hash function compared to hive, so we can't be compatible anyway. Limitations: * Can't write bucketed data without hive metastore. * Can't insert bucketed data into existing hive tables. Author: Wenchen Fan <wenchen@databricks.com> Closes #10498 from cloud-fan/bucket-write.