aboutsummaryrefslogtreecommitdiff
path: root/project
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-529][SQL] Modify SQLConf to use new config API from core.Marcelo Vanzin2016-04-051-2/+16
| | | | | | | | | | | | Because SQL keeps track of all known configs, some customization was needed in SQLConf to allow that, since the core API does not have that feature. Tested via existing (and slightly updated) unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11570 from vanzin/SPARK-529-sql.
* [SPARK-14366] Remove sbt-idea pluginLuciano Resende2016-04-041-2/+0
| | | | | | | | | | ## What changes were proposed in this pull request? Remove sbt-idea plugin as importing sbt project provides much better support. Author: Luciano Resende <lresende@apache.org> Closes #12151 from lresende/SPARK-14366.
* [SPARK-13579][BUILD] Stop building the main Spark assembly.Marcelo Vanzin2016-04-041-23/+22
| | | | | | | | | | | | | | | | | | | | This change modifies the "assembly/" module to just copy needed dependencies to its build directory, and modifies the packaging script to pick those up (and remove duplicate jars packages in the examples module). I also made some minor adjustments to dependencies to remove some test jars from the final packaging, and remove jars that conflict with each other when packaged separately (e.g. servlet api). Also note that this change restores guava in applications' classpaths, even though it's still shaded inside Spark. This is now needed for the Hadoop libraries that are packaged with Spark, which now are not processed by the shade plugin. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11796 from vanzin/SPARK-13579.
* [SPARK-14358] Change SparkListener from a trait to an abstract classReynold Xin2016-04-041-1/+10
| | | | | | | | | | | | ## What changes were proposed in this pull request? Scala traits are difficult to maintain binary compatibility on, and as a result we had to introduce JavaSparkListener. In Spark 2.0 we can change SparkListener from a trait to an abstract class and then remove JavaSparkListener. ## How was this patch tested? Updated related unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12142 from rxin/SPARK-14358.
* [SPARK-13674] [SQL] Add wholestage codegen support to SampleLiang-Chi Hsieh2016-04-011-0/+4
| | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-13674 ## What changes were proposed in this pull request? Sample operator doesn't support wholestage codegen now. This pr is to add support to it. ## How was this patch tested? A test is added into `BenchmarkWholeStageCodegen`. Besides, all tests should be passed. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11517 from viirya/add-wholestage-sample.
* [SPARK-11262][ML] Unit test for gradient, loss layers, memory management for ↵Alexander Ulanov2016-03-311-0/+5
| | | | | | | | | | | | | | | | | | | | | multilayer perceptron 1.Implement LossFunction trait and implement squared error and cross entropy loss with it 2.Implement unit test for gradient and loss 3.Implement InPlace trait and in-place layer evaluation 4.Refactor interface for ActivationFunction 5.Update of Layer and LayerModel interfaces 6.Fix random weights assignment 7.Implement memory allocation by MLP model instead of individual layers These features decreased the memory usage and increased flexibility of internal API. Author: Alexander Ulanov <nashb@yandex.ru> Author: avulanov <avulanov@gmail.com> Closes #9229 from avulanov/mlp-refactoring.
* [SPARK-14211][SQL] Remove ANTLR3 based parserHerman van Hovell2016-03-312-55/+2
| | | | | | | | | | | | | | | | ### What changes were proposed in this pull request? This PR removes the ANTLR3 based parser, and moves the new ANTLR4 based parser into the `org.apache.spark.sql.catalyst.parser package`. ### How was this patch tested? Existing unit tests. cc rxin andrewor14 yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12071 from hvanhovell/SPARK-14211.
* [SPARK-14205][SQL] remove trait QueryableWenchen Fan2016-03-281-0/+3
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? After DataFrame and Dataset are merged, the trait `Queryable` becomes unnecessary as it has only one implementation. We should remove it. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12001 from cloud-fan/df-ds.
* [SPARK-13713][SQL] Migrate parser from ANTLR3 to ANTLR4Herman van Hovell2016-03-282-2/+12
| | | | | | | | | | | | | | | | | | | | | | ### What changes were proposed in this pull request? The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4. This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs. This PR is a work in progress, and work needs to be done in the following area's: - [x] Error handling should be improved. - [x] Documentation should be improved. - [x] Multi-Insert needs to be tested. - [ ] Naming and package locations. ### How was this patch tested? Catalyst and SQL unit tests. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #11557 from hvanhovell/ngParser.
* [SPARK-14089][CORE][MLLIB] Remove methods that has been deprecated since ↵Liwei Lin2016-03-261-0/+13
| | | | | | | | | | | | | | | | | | 1.1, 1.2, 1.3, 1.4, and 1.5 ## What changes were proposed in this pull request? Removed methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5. ## How was this patch tested? - manully checked that no codes in Spark call these methods any more - existing test suits Author: Liwei Lin <lwlin7@gmail.com> Author: proflin <proflin.me@gmail.com> Closes #11910 from lw-lin/remove-deprecates.
* [SPARK-13874][DOC] Remove docs of streaming-akka, streaming-zeromq, ↵Shixiong Zhu2016-03-261-6/+1
| | | | | | | | | | | | | | | | | | streaming-mqtt and streaming-twitter ## What changes were proposed in this pull request? This PR removes all docs about the old streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter projects since I have already copied them to https://github.com/spark-packages Also remove mqtt_wordcount.py that I forgot to remove previously. ## How was this patch tested? Jenkins PR Build. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11824 from zsxwing/remove-doc.
* [SPARK-14073][STREAMING][TEST-MAVEN] Move flume back to SparkShixiong Zhu2016-03-251-8/+14
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR moves flume back to Spark as per the discussion in the dev mail-list. ## How was this patch tested? Existing Jenkins tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11895 from zsxwing/move-flume-back.
* [SPARK-14014][SQL] Integrate session catalog (attempt #2)Andrew Or2016-03-241-0/+3
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This reopens #11836, which was merged but promptly reverted because it introduced flaky Hive tests. ## How was this patch tested? See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`. Author: Andrew Or <andrew@databricks.com> Closes #11938 from andrewor14/session-catalog-again.
* Revert "[SPARK-14014][SQL] Replace existing catalog with SessionCatalog"Andrew Or2016-03-231-3/+0
| | | | This reverts commit 5dfc01976bb0d72489620b4f32cc12d620bb6260.
* [SPARK-14014][SQL] Replace existing catalog with SessionCatalogAndrew Or2016-03-231-0/+3
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? `SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`. As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely: - SPARK-14013: Properly implement temporary functions in `SessionCatalog` - SPARK-13879: Decide which DDL/DML commands to support natively in Spark - SPARK-?????: Implement the ones we do want to support through `SessionCatalog`. - SPARK-?????: Merge SQL/HiveContext ## How was this patch tested? This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`. Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #11836 from andrewor14/use-session-catalog.
* [SPARK-14088][SQL] Some Dataset API touch-upReynold Xin2016-03-221-0/+1
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1. Deprecated unionAll. It is pretty confusing to have both "union" and "unionAll" when the two do the same thing in Spark but are different in SQL. 2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more consistent with rest of the functions in KeyValueGroupedDataset. Also makes it more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing because it could be reducing a Dataset, or just reducing groups. 3. Added a "name" function, which is more natural to name columns than "as" for non-SQL users. 4. Remove "subtract" function since it is just an alias for "except". ## How was this patch tested? All changes should be covered by existing tests. Also added couple test cases to cover "name". Author: Reynold Xin <rxin@databricks.com> Closes #11908 from rxin/SPARK-14088.
* [SPARK-13990] Automatically pick serializer when caching RDDsJosh Rosen2016-03-211-0/+3
| | | | | | | | | | | | Building on the `SerializerManager` introduced in SPARK-13926/ #11755, this patch Spark modifies Spark's BlockManager to use RDD's ClassTags in order to select the best serializer to use when caching RDD blocks. When storing a local block, the BlockManager `put()` methods use implicits to record ClassTags and stores those tags in the blocks' BlockInfo records. When reading a local block, the stored ClassTag is used to pick the appropriate serializer. When a block is stored with replication, the class tag is written into the block transfer metadata and will also be stored in the remote BlockManager. There are two or three places where we don't properly pass ClassTags, including TorrentBroadcast and BlockRDD. I think this happens to work because the missing ClassTag always happens to be `ClassTag.Any`, but it might be worth looking more carefully at those places to see whether we should be more explicit. Author: Josh Rosen <joshrosen@databricks.com> Closes #11801 from JoshRosen/pick-best-serializer-for-caching.
* [SPARK-13898][SQL] Merge DatasetHolder and DataFrameHolderReynold Xin2016-03-211-6/+7
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch merges DatasetHolder and DataFrameHolder. This makes more sense because DataFrame/Dataset are now one class. In addition, fixed some minor issues with pull request #11732. ## How was this patch tested? Updated existing unit tests that test these implicits. Author: Reynold Xin <rxin@databricks.com> Closes #11737 from rxin/SPARK-13898.
* [SPARK-13897][SQL] RelationalGroupedDataset and KeyValueGroupedDatasetReynold Xin2016-03-191-0/+1
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Previously, Dataset.groupBy returns a GroupedData, and Dataset.groupByKey returns a GroupedDataset. The naming is very similar, and unfortunately does not convey the real differences between the two. Assume we are grouping by some keys (K). groupByKey is a key-value style group by, in which the schema of the returned dataset is a tuple of just two fields: key and value. groupBy, on the other hand, is a relational style group by, in which the schema of the returned dataset is flattened and contain |K| + |V| fields. This pull request also removes the experimental tag from RelationalGroupedDataset. It has been with DataFrame since 1.3, and we have enough confidence now to stabilize it. ## How was this patch tested? This is a rename to improve API understandability. Should be covered by all existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #11841 from rxin/SPARK-13897.
* [SPARK-13928] Move org.apache.spark.Logging into ↵Wenchen Fan2016-03-171-0/+8
| | | | | | | | | | | | | | | | org.apache.spark.internal.Logging ## What changes were proposed in this pull request? Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11764 from cloud-fan/logger.
* [SPARK-13948] MiMa check should catch if the visibility changes to privateJosh Rosen2016-03-163-15/+35
| | | | | | | | | | MiMa excludes are currently generated using both the current Spark version's classes and Spark 1.2.0's classes, but this doesn't make sense: we should only be ignoring classes which were `private` in the previous Spark version, not classes which became private in the current version. This patch updates `dev/mima` to only generate excludes with respect to the previous artifacts that MiMa checks against. It also updates `MimaBuild` so that `excludeClass` only applies directly to the class being excluded and not to its companion object (since a class and its companion object can have different accessibility). Author: Josh Rosen <joshrosen@databricks.com> Closes #11774 from JoshRosen/SPARK-13948.
* [SPARK-13926] Automatically use Kryo serializer when shuffling RDDs with ↵Josh Rosen2016-03-161-0/+5
| | | | | | | | | | | | | | simple types Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are known to be compatible with Kryo. This patch introduces `SerializerManager`, a component which picks the "best" serializer for a shuffle given the elements' ClassTags. It will automatically pick a Kryo serializer for ShuffledRDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings. In the future we can use this class as a narrow extension point to integrate specialized serializers for other types, such as ByteBuffers. In a planned followup patch, I will extend the BlockManager APIs so that we're able to use similar automatic serializer selection when caching RDDs (this is a little trickier because the ClassTags need to be threaded through many more places). Author: Josh Rosen <joshrosen@databricks.com> Closes #11755 from JoshRosen/automatically-pick-best-serializer.
* [MINOR][SQL][BUILD] Remove duplicated linesDongjoon Hyun2016-03-161-2/+0
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR removes three minor duplicated lines. First one is making the following unreachable code warning. ``` JoinSuite.scala:52: unreachable code [warn] case j: BroadcastHashJoin => j ``` The other two are just consecutive repetitions in `Seq` of MiMa filters. ## How was this patch tested? Pass the existing Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11773 from dongjoon-hyun/remove_duplicated_line.
* [SPARK-11011][SQL] Narrow type of UDT serializationJakob Odersky2016-03-161-0/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`, however it would logically make more sense to narrow it down to the type of the actual user defined type. ## How was this patch tested? Existing tests were successfully run on local machine. Author: Jakob Odersky <jakob@odersky.com> Closes #11379 from jodersky/SPARK-11011-udt-types.
* [SPARK-13927][MLLIB] add row/column iterator to local matricesXiangrui Meng2016-03-161-0/+4
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add row/column iterator to local matrices to simplify tasks like BlockMatrix => RowMatrix conversion. It handles dense and sparse matrices properly. ## How was this patch tested? Unit tests on sparse and dense matrix. cc: dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #11757 from mengxr/SPARK-13927.
* [SPARK-13920][BUILD] MIMA checks should apply to @Experimental and ↵Dongjoon Hyun2016-03-151-2/+211
| | | | | | | | | | | | | | | | @DeveloperAPI APIs ## What changes were proposed in this pull request? We are able to change `Experimental` and `DeveloperAPI` API freely but also should monitor and manage those API carefully. This PR for [SPARK-13920](https://issues.apache.org/jira/browse/SPARK-13920) enables MiMa check and adds filters for them. ## How was this patch tested? Pass the Jenkins tests (including MiMa). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11751 from dongjoon-hyun/SPARK-13920.
* [SPARK-13576][BUILD] Don't create assembly for examples.Marcelo Vanzin2016-03-151-3/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As part of the goal to stop creating assemblies in Spark, this change modifies the mvn and sbt builds to not create an assembly for examples. Instead, dependencies are copied to the build directory (under target/scala-xx/jars), and in the final archive, into the "examples/jars" directory. To avoid having to deal too much with Windows batch files, I made examples run through the launcher library; the spark-submit launcher now has a special mode to run examples, which adds all the necessary jars to the spark-submit command line, and replaces the bash and batch scripts that were used to run examples. The scripts are now just a thin wrapper around spark-submit; another advantage is that now all spark-submit options are supported. There are a few glitches; in the mvn build, a lot of duplicated dependencies get copied, because they are promoted to "compile" scope due to extra dependencies in the examples module (such as HBase). In the sbt build, all dependencies are copied, because there doesn't seem to be an easy way to filter things. I plan to clean some of this up when the rest of the tasks are finished. When the main assembly is replaced with jars, we can remove duplicate jars from the examples directory during packaging. Tested by running SparkPi in: maven build, sbt build, dist created by make-distribution.sh. Finally: note that running the "assembly" target in sbt doesn't build the examples anymore. You need to run "package" for that. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11452 from vanzin/SPARK-13576.
* [SPARK-13880][SPARK-13881][SQL] Rename DataFrame.scala Dataset.scala, and ↵Reynold Xin2016-03-151-0/+1
| | | | | | | | | | | | | | | remove LegacyFunctions ## What changes were proposed in this pull request? 1. Rename DataFrame.scala Dataset.scala, since the class is now named Dataset. 2. Remove LegacyFunctions. It was introduced in Spark 1.6 for backward compatibility, and can be removed in Spark 2.0. ## How was this patch tested? Should be covered by existing unit/integration tests. Author: Reynold Xin <rxin@databricks.com> Closes #11704 from rxin/SPARK-13880.
* [SPARK-13843][STREAMING] Remove streaming-flume, streaming-mqtt, ↵Shixiong Zhu2016-03-141-28/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | streaming-zeromq, streaming-akka, streaming-twitter to Spark packages ## What changes were proposed in this pull request? Currently there are a few sub-projects, each for integrating with different external sources for Streaming. Now that we have better ability to include external libraries (spark packages) and with Spark 2.0 coming up, we can move the following projects out of Spark to https://github.com/spark-packages - streaming-flume - streaming-akka - streaming-mqtt - streaming-zeromq - streaming-twitter They are just some ancillary packages and considering the overhead of maintenance, running tests and PR failures, it's better to maintain them out of Spark. In addition, these projects can have their different release cycles and we can release them faster. I have already copied these projects to https://github.com/spark-packages ## How was this patch tested? Jenkins tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11672 from zsxwing/remove-external-pkg.
* [SPARK-13686][MLLIB][STREAMING] Add a constructor parameter `reqParam` to ↵Dongjoon Hyun2016-03-141-0/+3
| | | | | | | | | | | | | | | | | (Streaming)LinearRegressionWithSGD ## What changes were proposed in this pull request? `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not have `regParam` as their constructor arguments. They just depends on GradientDescent's default reqParam values. To be consistent with other algorithms, we had better add them. The same default value is used. ## How was this patch tested? Pass the existing unit test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11527 from dongjoon-hyun/SPARK-13686.
* [MINOR][DOCS] Fix more typos in comments/strings.Dongjoon Hyun2016-03-141-3/+3
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes 135 typos over 107 files: * 121 typos in comments * 11 typos in testcase name * 3 typos in log messages ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11689 from dongjoon-hyun/fix_more_typos.
* [SPARK-13834][BUILD] Update sbt and sbt plugins for 2.x.Dongjoon Hyun2016-03-134-9/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? For 2.0.0, we had better make **sbt** and **sbt plugins** up-to-date. This PR checks the status of each plugins and bumps the followings. * sbt: 0.13.9 --> 0.13.11 * sbteclipse-plugin: 2.2.0 --> 4.0.0 * sbt-dependency-graph: 0.7.4 --> 0.8.2 * sbt-mima-plugin: 0.1.6 --> 0.1.9 * sbt-revolver: 0.7.2 --> 0.8.0 All other plugins are up-to-date. (Note that `sbt-avro` seems to be change from 0.3.2 to 1.0.1, but it's not published in the repository.) During upgrade, this PR also updated the following MiMa error. Note that the related excluding filter is already registered correctly. It seems due to the change of MiMa exception result. ``` // SPARK-12896 Send only accumulator updates to driver, not TaskMetrics ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulable.this"), -ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulator.this"), +ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.Accumulator.this"), ``` ## How was this patch tested? Pass the Jenkins build. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11669 from dongjoon-hyun/update_mima.
* [SPARK-13817][BUILD][SQL] Re-enable MiMA and removes object DataFrameCheng Lian2016-03-111-0/+22
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? PR #11443 temporarily disabled MiMA check, this PR re-enables it. One extra change is that `object DataFrame` is also removed. The only purpose of introducing `object DataFrame` was to use it as an internal factory for creating `Dataset[Row]`. By replacing this internal factory with `Dataset.newDataFrame`, both `DataFrame` and `DataFrame$` are entirely removed from the API, so that we can simply put a `MissingClassProblem` filter in `MimaExcludes.scala` for most DataFrame API changes. ## How was this patch tested? Tested by MiMA check triggered by Jenkins. Author: Cheng Lian <lian@databricks.com> Closes #11656 from liancheng/re-enable-mima.
* [SPARK-13294][PROJECT INFRA] Remove MiMa's dependency on spark-class / Spark ↵Josh Rosen2016-03-101-9/+10
| | | | | | | | | | | | | | | assembly This patch removes the need to build a full Spark assembly before running the `dev/mima` script. - I modified the `tools` project to remove a direct dependency on Spark, so `sbt/sbt tools/fullClasspath` will now return the classpath for the `GenerateMIMAIgnore` class itself plus its own dependencies. - This required me to delete two classes full of dead code that we don't use anymore - `GenerateMIMAIgnore` now uses [ClassUtil](http://software.clapper.org/classutil/) to find all of the Spark classes rather than our homemade JAR traversal code. The problem in our own code was that it didn't handle folders of classes properly, which is necessary in order to generate excludes with an assembly-free Spark build. - `./dev/mima` no longer runs through `spark-class`, eliminating the need to reason about classpath ordering between `SPARK_CLASSPATH` and the assembly. Author: Josh Rosen <joshrosen@databricks.com> Closes #11178 from JoshRosen/remove-assembly-in-run-tests.
* [SPARK-13665][SQL] Separate the concerns of HadoopFsRelationMichael Armbrust2016-03-071-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `HadoopFsRelation` is used for reading most files into Spark SQL. However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data. As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency. This PR is a first cut at separating this into several components / interfaces that are each described below. Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`. External libraries, such as spark-avro will also need to be ported to work with Spark 2.0. ### HadoopFsRelation A simple `case class` that acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed. This an internal representation that no longer needs to be exposed to developers. ```scala case class HadoopFsRelation( sqlContext: SQLContext, location: FileCatalog, partitionSchema: StructType, dataSchema: StructType, bucketSpec: Option[BucketSpec], fileFormat: FileFormat, options: Map[String, String]) extends BaseRelation ``` ### FileFormat The primary interface that will be implemented by each different format including external libraries. Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`. A format can optionally return a schema that is inferred from a set of files. ```scala trait FileFormat { def inferSchema( sqlContext: SQLContext, options: Map[String, String], files: Seq[FileStatus]): Option[StructType] def prepareWrite( sqlContext: SQLContext, job: Job, options: Map[String, String], dataSchema: StructType): OutputWriterFactory def buildInternalScan( sqlContext: SQLContext, dataSchema: StructType, requiredColumns: Array[String], filters: Array[Filter], bucketSet: Option[BitSet], inputFiles: Array[FileStatus], broadcastedConf: Broadcast[SerializableConfiguration], options: Map[String, String]): RDD[InternalRow] } ``` The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner). Additionally, scans are still returning `RDD`s instead of iterators for single files. In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file. ### FileCatalog This interface is used to list the files that make up a given relation, as well as handle directory based partitioning. ```scala trait FileCatalog { def paths: Seq[Path] def partitionSpec(schema: Option[StructType]): PartitionSpec def allFiles(): Seq[FileStatus] def getStatus(path: Path): Array[FileStatus] def refresh(): Unit } ``` Currently there are two implementations: - `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`. Infers partitioning by recursive listing and caches this data for performance - `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore. ### ResolvedDataSource Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore): - `paths: Seq[String] = Nil` - `userSpecifiedSchema: Option[StructType] = None` - `partitionColumns: Array[String] = Array.empty` - `bucketSpec: Option[BucketSpec] = None` - `provider: String` - `options: Map[String, String]` This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones). All reconciliation of partitions, buckets, schema from metastores or inference is done here. ### DataSourceAnalysis / DataSourceStrategy Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including: - pruning the files from partitions that will be read based on filters. - appending partition columns* - applying additional filters when a data source can not evaluate them internally. - constructing an RDD that is bucketed correctly when required* - sanity checking schema match-up and other analysis when writing. *In the future we should do that following: - Break out file handling into its own Strategy as its sufficiently complex / isolated. - Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization. - Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2` Author: Michael Armbrust <michael@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #11509 from marmbrus/fileDataSource.
* [SPARK-12073][STREAMING] backpressure rate controller consumes events ↵Jason White2016-03-041-0/+4
| | | | | | | | | | | | | | | | | | | | | | preferentially from lagg… …ing partitions I'm pretty sure this is the reason we couldn't easily recover from an unbalanced Kafka partition under heavy load when using backpressure. `maxMessagesPerPartition` calculates an appropriate limit for the message rate from all partitions, and then divides by the number of partitions to determine how many messages to retrieve per partition. The problem with this approach is that when one partition is behind by millions of records (due to random Kafka issues), but the rate estimator calculates only 100k total messages can be retrieved, each partition (out of say 32) only retrieves max 100k/32=3125 messages. This PR (still needing a test) determines a per-partition desired message count by using the current lag for each partition to preferentially weight the total message limit among the partitions. In this situation, if each partition gets 1k messages, but 1 partition starts 1M behind, then the total number of messages to retrieve is (32 * 1k + 1M) = 1032000 messages, of which the one partition needs 1001000. So, it gets (1001000 / 1032000) = 97% of the 100k messages, and the other 31 partitions share the remaining 3%. Assuming all of 100k the messages are retrieved and processed within the batch window, the rate calculator will increase the number of messages to retrieve in the next batch, until it reaches a new stable point or the backlog is finished processed. We're going to try deploying this internally at Shopify to see if this resolves our issue. tdas koeninger holdenk Author: Jason White <jason.white@shopify.com> Closes #10089 from JasonMWhite/rate_controller_offsets.
* [MINOR] Fix typos in comments and testcase name of codeDongjoon Hyun2016-03-031-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes typos in comments and testcase name of code. ## How was this patch tested? manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.
* [SPARK-13526][SQL] Move SQLContext per-session states to new classAndrew Or2016-02-271-1/+6
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This creates a `SessionState`, which groups a few fields that existed in `SQLContext`. Because `HiveContext` extends `SQLContext` we also need to make changes there. This is mainly a cleanup task that will soon pave the way for merging the two contexts. ## How was this patch tested? Existing unit tests; this patch introduces no change in behavior. Author: Andrew Or <andrew@databricks.com> Closes #11405 from andrewor14/refactor-session.
* [SPARK-13465] Add a task failure listener to TaskContextReynold Xin2016-02-261-1/+3
| | | | | | | | | | | | | ## What changes were proposed in this pull request? TaskContext supports task completion callback, which gets called regardless of task failures. However, there is no way for the listener to know if there is an error. This patch adds a new listener that gets called when a task fails. ## How was the this patch tested? New unit test case and integration test case covering the code path Author: Reynold Xin <rxin@databricks.com> Closes #11340 from rxin/SPARK-13465.
* [SPARK-13486][SQL] Move SQLConf into an internal packageReynold Xin2016-02-251-0/+6
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch moves SQLConf into org.apache.spark.sql.internal package to make it very explicit that it is internal. Soon I will also submit more API work that creates implementations of interfaces in this internal package. ## How was this patch tested? If it compiles, then the refactoring should work. Author: Reynold Xin <rxin@databricks.com> Closes #11363 from rxin/SPARK-13486.
* [SPARK-7729][UI] Executor which has been killed should also be displayed on ↵Lianhui Wang2016-02-231-0/+3
| | | | | | | | | | | | | | | Executor Tab andrewor14 squito Dead Executors should also be displayed on Executor Tab. as following: ![image](https://cloud.githubusercontent.com/assets/545478/11492707/ae55d7f6-982b-11e5-919a-b62cd84684b2.png) Author: Lianhui Wang <lianhuiwang09@gmail.com> This patch had conflicts when merged, resolved by Committer: Andrew Or <andrew@databricks.com> Closes #10058 from lianhuiwang/SPARK-7729.
* [SPARK-13220][CORE] deprecate yarn-client and yarn-cluster modejerryshao2016-02-231-2/+6
| | | | | | Author: jerryshao <sshao@hortonworks.com> Closes #11229 from jerryshao/SPARK-13220.
* [SPARK-13413] Remove SparkContext.metricsSystemReynold Xin2016-02-221-1/+5
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch removes SparkContext.metricsSystem. SparkContext.metricsSystem returns MetricsSystem, which is a private class. I think it was added by accident. In addition, I also removed an unused private[spark] method schedulerBackend setter. ## How was the this patch tested? N/A. Author: Reynold Xin <rxin@databricks.com> This patch had conflicts when merged, resolved by Committer: Josh Rosen <joshrosen@databricks.com> Closes #11282 from rxin/SPARK-13413.
* [SPARK-13426][CORE] Remove the support of SIMRjerryshao2016-02-221-0/+3
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR removes the support of SIMR, since SIMR is not actively used and maintained for a long time, also is not supported from `SparkSubmit`, so here propose to remove it. ## How was the this patch tested? This patch is tested locally by running unit tests. Author: jerryshao <sshao@hortonworks.com> Closes #11296 from jerryshao/SPARK-13426.
* [SPARK-13248][STREAMING] Remove deprecated Streaming APIs.Luciano Resende2016-02-211-0/+13
| | | | | | | | Remove deprecated Streaming APIs and adjust sample applications. Author: Luciano Resende <lresende@apache.org> Closes #11139 from lresende/streaming-deprecated-apis.
* [SPARK-13109][BUILD] Fix SBT publishLocal issuejerryshao2016-02-171-1/+2
| | | | | | | | | | Add local ivy repo to the SBT build file to fix this. Scaladoc compile error is fixed. Author: jerryshao <sshao@hortonworks.com> Closes #11001 from jerryshao/SPARK-13109.
* [SPARK-12995][GRAPHX] Remove deprecate APIs from PregelTakeshi YAMAMURO2016-02-151-0/+6
| | | | | | Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10918 from maropu/RemoveDeprecateInPregel.
* [SPARK-13278][CORE] Launcher fails to start with JDK 9 EAClaes Redestad2016-02-141-2/+4
| | | | | | | | See http://openjdk.java.net/jeps/223 for more information about the JDK 9 version string scheme. Author: Claes Redestad <claes.redestad@gmail.com> Closes #11160 from cl4es/master.
* [SPARK-13296][SQL] Move UserDefinedFunction into sql.expressions.Reynold Xin2016-02-131-1/+7
| | | | | | | | | | | | | | | | This pull request has the following changes: 1. Moved UserDefinedFunction into expressions package. This is more consistent with how we structure the packages for window functions and UDAFs. 2. Moved UserDefinedPythonFunction into execution.python package, so we don't have a random private class in the top level sql package. 3. Move everything in execution/python.scala into the newly created execution.python package. Most of the diffs are just straight copy-paste. Author: Reynold Xin <rxin@databricks.com> Closes #11181 from rxin/SPARK-13296.
* [SPARK-7889][WEBUI] HistoryServer updates UI for incomplete appsSteve Loughran2016-02-111-0/+3
| | | | | | | | | | | When the HistoryServer is showing an incomplete app, it needs to check if there is a newer version of the app available. It does this by checking if a version of the app has been loaded with a larger *filesize*. If so, it detaches the current UI, attaches the new one, and redirects back to the same URL to show the new UI. https://issues.apache.org/jira/browse/SPARK-7889 Author: Steve Loughran <stevel@hortonworks.com> Author: Imran Rashid <irashid@cloudera.com> Closes #11118 from squito/SPARK-7889-alternate.