aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-14994][SQL] Remove execution hive from HiveSessionStateReynold Xin2016-04-2920-309/+327
| | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch removes executionHive from HiveSessionState and HiveSharedState. ## How was this patch tested? Updated test cases. Author: Reynold Xin <rxin@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12770 from rxin/SPARK-14994.
* [SPARK-14996][SQL] Add TPCDS Benchmark Queries for SparkSQLSameer Agarwal2016-04-291-0/+1225
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds support for easily running and benchmarking a set of common TPCDS queries locally in SparkSQL. ## How was this patch tested? N/A Author: Sameer Agarwal <sameer@databricks.com> Closes #12771 from sameeragarwal/tpcds-2.
* [SPARK-12660][SPARK-14967][SQL] Implement Except Distinct by Left Anti Joingatorsmile2016-04-2912-111/+132
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins). ```SQL SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 ``` Note: 1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL. 2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated join conditions will be incorrect. This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like ```SQL test("except") { val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id") val df_right = Seq(1, 3).toDF("id") checkAnswer( df_left.except(df_right), Row(2) :: Row(2) :: Row(4) :: Nil ) } ``` After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`. #### How was this patch tested? Modified and added a few test cases to verify the optimization rule and the results of operators. Author: gatorsmile <gatorsmile@gmail.com> Closes #12736 from gatorsmile/exceptByAntiJoin.
* [HOTFIX] Disable flaky test StatisticsSuite.analyze MetastoreRelationsReynold Xin2016-04-291-1/+2
|
* [SPARK-14886][MLLIB] RankingMetrics.ndcgAt throw ↵Sean Owen2016-04-292-6/+22
| | | | | | | | | | | | | | | | java.lang.ArrayIndexOutOfBoundsException ## What changes were proposed in this pull request? Handle case where number of predictions is less than label set, k in nDCG computation ## How was this patch tested? New unit test; existing tests Author: Sean Owen <sowen@cloudera.com> Closes #12756 from srowen/SPARK-14886.
* [HOTFIX] Disable flaky LauncherServerSuite.testCommunicationReynold Xin2016-04-291-1/+2
|
* [MINOR][DOC] Minor typo fixesZheng RuiFeng2016-04-281-15/+15
| | | | | | | | | | | | ## What changes were proposed in this pull request? Minor typo fixes ## How was this patch tested? local build Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12755 from zhengruifeng/fix_doc_dataset.
* [SPARK-14829][MLLIB] Deprecate GLM APIs using SGDZheng RuiFeng2016-04-286-0/+37
| | | | | | | | | | | | ## What changes were proposed in this pull request? According to the [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829), deprecate API of LogisticRegression and LinearRegression using SGD ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12596 from zhengruifeng/deprecate_sgd.
* [SPARK-7264][ML] Parallel lapply for sparkRTimothy Hunter2016-04-283-0/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds a new function in SparkR called `sparkLapply(list, function)`. This function implements a distributed version of `lapply` using Spark as a backend. TODO: - [x] check documentation - [ ] check tests Trivial example in SparkR: ```R sparkLapply(1:5, function(x) { 2 * x }) ``` Output: ``` [[1]] [1] 2 [[2]] [1] 4 [[3]] [1] 6 [[4]] [1] 8 [[5]] [1] 10 ``` Here is a slightly more complex example to perform distributed training of multiple models. Under the hood, Spark broadcasts the dataset. ```R library("MASS") data(menarche) families <- c("gaussian", "poisson") train <- function(family){glm(Menarche ~ Age , family=family, data=menarche)} results <- sparkLapply(families, train) ``` ## How was this patch tested? This PR was tested in SparkR. I am unfamiliar with R and SparkR, so any feedback on style, testing, etc. will be much appreciated. cc falaki davies Author: Timothy Hunter <timhunter@databricks.com> Closes #12426 from thunterdb/7264.
* [SPARK-14991][SQL] Remove HiveNativeCommandReynold Xin2016-04-2815-283/+51
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch removes HiveNativeCommand, so we can continue to remove the dependency on Hive. This pull request also removes the ability to generate golden result file using Hive. ## How was this patch tested? Updated tests to reflect this. Author: Reynold Xin <rxin@databricks.com> Closes #12769 from rxin/SPARK-14991.
* [HOTFIX][CORE] fix a concurrence issue in NewAccumulatorWenchen Fan2016-04-284-8/+14
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? `AccumulatorContext` is not thread-safe, that's why all of its methods are synchronized. However, there is one exception: the `AccumulatorContext.originals`. `NewAccumulator` use it to check if it's registered, which is wrong as it's not synchronized. This PR mark `AccumulatorContext.originals` as `private` and now all access to `AccumulatorContext` is synchronized. ## How was this patch tested? I verified it locally. To be safe, we can let jenkins test it many times to make sure this problem is gone. Author: Wenchen Fan <wenchen@databricks.com> Closes #12773 from cloud-fan/debug.
* Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in ↵Yin Huai2016-04-2836-161/+51
| | | | | | spark-mllib-local" This reverts commit dae538a4d7c36191c1feb02ba87ffc624ab960dc.
* [SPARK-14836][YARN] Zip all the jars before uploading to distributed cachejerryshao2016-04-282-9/+23
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? <copy form JIRA> Currently if neither `spark.yarn.jars` nor `spark.yarn.archive` is set (by default), Spark on yarn code will upload all the jars in the folder separately into distributed cache, this is quite time consuming, and very verbose, instead of upload jars separately into distributed cache, here changes to zip all the jars first, and then put into distributed cache. This will significantly improve the speed of starting time. ## How was this patch tested? Unit test and local integrated test is done. Verified with SparkPi both in spark cluster and client mode. Author: jerryshao <sshao@hortonworks.com> Closes #12597 from jerryshao/SPARK-14836.
* [SPARK-14862][ML] Updated Classifiers to not require labelCol metadataJoseph K. Bradley2016-04-288-31/+245
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Updated Classifier, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier to not require input column metadata. * They first check for metadata. * If numClasses is not specified in metadata, they identify the largest label value (up to a limit). This functionality is implemented in a new Classifier.getNumClasses method. Also * Updated Classifier.extractLabeledPoints to (a) check label values and (b) include a second version which takes a numClasses value for validity checking. ## How was this patch tested? * Unit tests in ClassifierSuite for helper methods * Unit tests for DecisionTreeClassifier, RandomForestClassifier, GBTClassifier with toy datasets lacking label metadata Author: Joseph K. Bradley <joseph@databricks.com> Closes #12663 from jkbradley/trees-no-metadata.
* [SPARK-14613][ML] Add @Since into the matrix and vector classes in ↵Pravin Gadakh2016-04-2836-51/+161
| | | | | | | | | | | | | | | | spark-mllib-local ## What changes were proposed in this pull request? This PR adds `since` tag into the matrix and vector classes in spark-mllib-local. ## How was this patch tested? Scala-style checks passed. Author: Pravin Gadakh <prgadakh@in.ibm.com> Closes #12416 from pravingadakh/SPARK-14613.
* [SPARK-14555] Second cut of Python API for Structured StreamingBurak Yavuz2016-04-285-46/+217
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds Python APIs for: - `ContinuousQueryManager` - `ContinuousQueryException` The `ContinuousQueryException` is a very basic wrapper, it doesn't provide the functionality that the Scala side provides, but it follows the same pattern for `AnalysisException`. For `ContinuousQueryManager`, all APIs are provided except for registering listeners. This PR also attempts to fix test flakiness by stopping all active streams just before tests. ## How was this patch tested? Python Doc tests and unit tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #12673 from brkyvz/pyspark-cqm.
* [SPARK-12810][PYSPARK] PySpark CrossValidatorModel should support avgMetricsKai Jiang2016-04-282-8/+37
| | | | | | | | | | | | ## What changes were proposed in this pull request? support avgMetrics in CrossValidatorModel with Python ## How was this patch tested? Doctest and `test_save_load` in `pyspark/ml/test.py` [JIRA](https://issues.apache.org/jira/browse/SPARK-12810) Author: Kai Jiang <jiangkai@gmail.com> Closes #12464 from vectorijk/spark-12810.
* [SPARK-14970][SQL] Prevent DataSource from enumerates all files in a ↵Tathagata Das2016-04-281-10/+9
| | | | | | | | | | | | | | directory if there is user specified schema ## What changes were proposed in this pull request? The FileCatalog object gets created even if the user specifies schema, which means files in the directory is enumerated even thought its not necessary. For large directories this is very slow. User would want to specify schema in such scenarios of large dirs, and this defeats the purpose quite a bit. ## How was this patch tested? Hard to test this with unit test. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12748 from tdas/SPARK-14970.
* [SPARK-14916][MLLIB] A more friendly tostring for FreqItemset in mllib.fpmYuhao Yang2016-04-281-0/+4
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-14916 FreqItemset as the result of FPGrowth should have a more friendly toString(), to help users and developers. sample: {a, b}: 5 {x, y, z}: 4 ## How was this patch tested? existing unit tests. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #12698 from hhbyyh/freqtos.
* [SPARK-14852][ML] refactored GLM summary into training, non-training summariesJoseph K. Bradley2016-04-282-55/+115
| | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This splits GeneralizedLinearRegressionSummary into 2 summary types: * GeneralizedLinearRegressionSummary, which does not store info from fitting (diagInvAtWA) * GeneralizedLinearRegressionTrainingSummary, which is a subclass of GeneralizedLinearRegressionSummary and stores info from fitting This also add a method evaluate() which can produce a GeneralizedLinearRegressionSummary on a new dataset. The summary no longer provides the model itself as a public val. Also: * Fixes bug where GeneralizedLinearRegressionTrainingSummary was created with model, not summaryModel. * Adds hasSummary method. * Renames findSummaryModelAndPredictionCol -> getSummaryModel and simplifies that method. * In summary, extract values from model immediately in case user later changes those (e.g., predictionCol). * Pardon the style fixes; that is IntelliJ being obnoxious. ## How was this patch tested? Existing unit tests + updated test for evaluate and hasSummary Author: Joseph K. Bradley <joseph@databricks.com> Closes #12624 from jkbradley/model-summary-api.
* [SPARK-14965][SQL] Indicate an exception is thrown for a missing struct fieldGregory Hart2016-04-281-4/+9
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix to ScalaDoc for StructType. ## How was this patch tested? Built locally. Author: Gregory Hart <greg.hart@thinkbiganalytics.com> Closes #12758 from freastro/hotfix/SPARK-14965.
* [SPARK-14945][PYTHON] SparkSession Python APIAndrew Or2016-04-286-240/+585
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? ``` Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT /_/ Using Python version 2.7.5 (default, Mar 9 2014 22:15:05) SparkSession available as 'spark'. >>> spark <pyspark.sql.session.SparkSession object at 0x101f3bfd0> >>> spark.sql("SHOW TABLES").show() ... +---------+-----------+ |tableName|isTemporary| +---------+-----------+ | src| false| +---------+-----------+ >>> spark.range(1, 10, 2).show() +---+ | id| +---+ | 1| | 3| | 5| | 7| | 9| +---+ ``` **Note**: This API is NOT complete in its current state. In particular, for now I left out the `conf` and `catalog` APIs, which were added later in Scala. These will be added later before 2.0. ## How was this patch tested? Python tests. Author: Andrew Or <andrew@databricks.com> Closes #12746 from andrewor14/python-spark-session.
* [SPARK-14935][CORE] DistributedSuite "local-cluster format" shouldn't ↵Xin Ren2016-04-281-12/+15
| | | | | | | | | | | | | | actually launch clusters https://issues.apache.org/jira/browse/SPARK-14935 In DistributedSuite, the "local-cluster format" test actually launches a bunch of clusters, but this doesn't seem necessary for what should just be a unit test of a regex. We should clean up the code so that this is testable without actually launching a cluster, which should buy us about 20 seconds per build. Passed unit test on my local machine Author: Xin Ren <iamshrek@126.com> Closes #12744 from keypointt/SPARK-14935.
* [SPARK-14882][DOCS] Clarify that Spark can be cross-built for other Scala ↵Sean Owen2016-04-281-1/+2
| | | | | | | | | | | | | | | | versions ## What changes were proposed in this pull request? Add simple clarification that Spark can be cross-built for other Scala versions. ## How was this patch tested? Automated doc build Author: Sean Owen <sowen@cloudera.com> Closes #12757 from srowen/SPARK-14882.
* [SPARK-6735][YARN] Add window based executor failure tracking mechanism for ↵jerryshao2016-04-285-8/+93
| | | | | | | | | | | | long running service This work is based on twinkle-sachdeva 's proposal. In parallel to such mechanism for AM failures, here add similar mechanism for executor failure tracking, this is useful for long running Spark service to mitigate the executor failure problems. Please help to review, tgravescs sryza and vanzin Author: jerryshao <sshao@hortonworks.com> Closes #10241 from jerryshao/SPARK-6735.
* [SPARK-12235][SPARKR] Enhance mutate() to support replace existing columns.Sun Rui2016-04-282-9/+69
| | | | | | | | | | Make the behavior of mutate more consistent with that in dplyr, besides support for replacing existing columns. 1. Throw error message when there are duplicated column names in the DataFrame being mutated. 2. when there are duplicated column names in specified columns by arguments, the last column of the same name takes effect. Author: Sun Rui <rui.sun@intel.com> Closes #10220 from sun-rui/SPARK-12235.
* [SPARK-14576][WEB UI] Spark console should display Web UI urlErgin Seyfe2016-04-284-6/+12
| | | | | | | | | | | | | ## What changes were proposed in this pull request? This is a proposal to print the Spark Driver UI link when spark-shell is launched. ## How was this patch tested? Launched spark-shell in local mode and cluster mode. Spark-shell console output included following line: "Spark context Web UI available at <Spark web url>" Author: Ergin Seyfe <eseyfe@fb.com> Closes #12341 from seyfe/spark_console_display_webui_link.
* [SPARK-14487][SQL] User Defined Type registration without SQLUserDefinedType ↵Liang-Chi Hsieh2016-04-288-4/+513
| | | | | | | | | | | | | | | | | | | | annotation ## What changes were proposed in this pull request? Currently we use `SQLUserDefinedType` annotation to register UDTs for user classes. However, by doing this, we add Spark dependency to user classes. For some user classes, it is unnecessary to add such dependency that will increase deployment difficulty. We should provide alternative approach to register UDTs for user classes without `SQLUserDefinedType` annotation. ## How was this patch tested? `UserDefinedTypeSuite` Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12259 from viirya/improve-sql-usertype.
* [SPARK-14654][CORE] New accumulator APIWenchen Fan2016-04-2873-842/+1071
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR introduces a new accumulator API which is much simpler than before: 1. the type hierarchy is simplified, now we only have an `Accumulator` class 2. Combine `initialValue` and `zeroValue` concepts into just one concept: `zeroValue` 3. there in only one `register` method, the accumulator registration and cleanup registration are combined. 4. the `id`,`name` and `countFailedValues` are combined into an `AccumulatorMetadata`, and is provided during registration. `SQLMetric` is a good example to show the simplicity of this new API. What we break: 1. no `setValue` anymore. In the new API, the intermedia type can be different from the result type, it's very hard to implement a general `setValue` 2. accumulator can't be serialized before registered. Problems need to be addressed in follow-ups: 1. with this new API, `AccumulatorInfo` doesn't make a lot of sense, the partial output is not partial updates, we need to expose the intermediate value. 2. `ExceptionFailure` should not carry the accumulator updates. Why do users care about accumulator updates for failed cases? It looks like we only use this feature to update the internal metrics, how about we sending a heartbeat to update internal metrics after the failure event? 3. the public event `SparkListenerTaskEnd` carries a `TaskMetrics`. Ideally this `TaskMetrics` don't need to carry external accumulators, as the only method of `TaskMetrics` that can access external accumulators is `private[spark]`. However, `SQLListener` use it to retrieve sql metrics. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #12612 from cloud-fan/acc.
* [SPARK-10001][CORE] Don't short-circuit actions in signal handlersJakob Odersky2016-04-271-3/+5
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? The current signal handlers have a subtle bug that stops evaluating registered actions as soon as one of them returns true, this is because `forall` is short-circuited. This PR adds a strict mapping stage before evaluating returned result. There are no known occurrences of the bug and this is a preemptive fix. ## How was this patch tested? As with the original introduction of signal handlers, this was tested manually (unit testing with signals is not straightforward). Author: Jakob Odersky <jakob@odersky.com> Closes #12745 from jodersky/SPARK-10001-hotfix.
* [SPARK-14961] Build HashedRelation larger than 1GDavies Liu2016-04-272-57/+107
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, LongToUnsafeRowMap use byte array as the underlying page, which can't be larger 1G. This PR improves LongToUnsafeRowMap to scale up to 8G bytes by using array of Long instead of array of byte. ## How was this patch tested? Manually ran a test to confirm that both UnsafeHashedRelation and LongHashedRelation could build a map that larger than 2G. Author: Davies Liu <davies@databricks.com> Closes #12740 from davies/larger_broadcast.
* [SPARK-12143][SQL] Binary type support for Hive thrift serverhyukjinkwon2016-04-272-2/+23
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-12143 This PR adds the support for conversion between `SparkRow` in Spark and `RowSet` in Hive for `BinaryType` as `Array[Byte]` (JDBC) ## How was this patch tested? Unittests in `HiveThriftBinaryServerSuite` (regression test) Closes #10139 Author: hyukjinkwon <gurwls223@gmail.com> Closes #12733 from HyukjinKwon/SPARK-12143.
* [MINOR][MAINTENANCE] Sort the entries in .gitignore.Arun Allamsetty2016-04-271-47/+47
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? The contents of `.gitignore` have been sorted to make it more readable. The actual contents of the file have not been changed. ## How was this patch tested? Does not require any tests. Author: Arun Allamsetty <arun@instructure.com> Closes #12742 from aa8y/gitignore.
* [SPARK-14966] SizeEstimator should ignore classes in the scala.reflect packageJosh Rosen2016-04-271-0/+3
| | | | | | | | | | In local profiling, I noticed SizeEstimator spending tons of time estimating the size of objects which contain TypeTag or ClassTag fields. The problem with these tags is that they reference global Scala reflection objects, which, in turn, reference many singletons, such as TestHive. This throws off the accuracy of the size estimation and wastes tons of time traversing a huge object graph. As a result, I think that SizeEstimator should ignore any classes in the `scala.reflect` package. Author: Josh Rosen <joshrosen@databricks.com> Closes #12741 from JoshRosen/ignore-scala-reflect-in-size-estimator.
* [SPARK-14671][ML] Pipeline setStages should handle subclasses of PipelineStageJoseph K. Bradley2016-04-272-2/+12
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Pipeline.setStages failed for some code examples which worked in 1.5 but fail in 1.6. This tends to occur when using a mix of transformers from ml.feature. It is because Java Arrays are non-covariant and the addition of MLWritable to some transformers means the stages0/1 arrays above are not of type Array[PipelineStage]. This PR modifies the following to accept subclasses of PipelineStage: * Pipeline.setStages() * Params.w() ## How was this patch tested? Unit test which fails to compile before this fix. Author: Joseph K. Bradley <joseph@databricks.com> Closes #12430 from jkbradley/pipeline-setstages.
* Revert "[SPARK-14683][DOCUMENTATION] Configure external links in ScalaDoc"Josh Rosen2016-04-271-2/+0
| | | | This reverts commit 3f49afee937a66d458e0c194e46d6a9e380e054e.
* [SPARK-13436][SPARKR] Added parameter drop to subsetting operator [Oscar D. Lara Yejas2016-04-273-42/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added parameter drop to subsetting operator [. This is useful to get a Column from a DataFrame, given its name. R supports it. In R: ``` > name <- "Sepal_Length" > class(iris[, name]) [1] "numeric" ``` Currently, in SparkR: ``` > name <- "Sepal_Length" > class(irisDF[, name]) [1] "DataFrame" ``` Previous code returns a DataFrame, which is inconsistent with R's behavior. SparkR should return a Column instead. Currently, in order for the user to return a Column given a column name as a character variable would be through `eval(parse(x))`, where x is the string `"irisDF$Sepal_Length"`. That itself is pretty hacky. `SparkR:::getColumn() `is another choice, but I don't see why this method should be externalized. Instead, following R's way to do things, the proposed implementation allows this: ``` > name <- "Sepal_Length" > class(irisDF[, name, drop=T]) [1] "Column" > class(irisDF[, name, drop=F]) [1] "DataFrame" ``` This is consistent with R: ``` > name <- "Sepal_Length" > class(iris[, name]) [1] "numeric" > class(iris[, name, drop=F]) [1] "data.frame" ``` Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com> Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net> Closes #11318 from olarayej/SPARK-13436.
* [SPARK-14940][SQL] Move ExternalCatalog to own fileAndrew Or2016-04-2712-188/+210
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? `interfaces.scala` was getting big. This just moves the biggest class in there to a new file for cleanliness. ## How was this patch tested? Just moving things around. Author: Andrew Or <andrew@databricks.com> Closes #12721 from andrewor14/move-external-catalog.
* [SPARK-14899][ML][PYSPARK] Remove spark.ml HashingTF hashingAlg optionYanbo Liang2016-04-275-91/+36
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Since [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574) breaks behavior of ```HashingTF```, we should try to enforce good practice by removing the "native" hashAlgorithm option in spark.ml and pyspark.ml. We can leave spark.mllib and pyspark.mllib alone. ## How was this patch tested? Unit tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12702 from yanboliang/spark-14899.
* [SPARK-14954] [SQL] Add PARTITION BY and BUCKET BY clause for data source ↵Cheng Lian2016-04-273-3/+106
| | | | | | | | | | | | | | | | | | | | | CTAS syntax Currently, we can only create persisted partitioned and/or bucketed data source tables using the Dataset API but not using SQL DDL. This PR implements the following syntax to add partitioning and bucketing support to the SQL DDL: ``` CREATE TABLE <table-name> USING <provider> [OPTIONS (<key1> <value1>, <key2> <value2>, ...)] [PARTITIONED BY (col1, col2, ...)] [CLUSTERED BY (col1, col2, ...) [SORTED BY (col1, col2, ...)] INTO <n> BUCKETS] AS SELECT ... ``` Test cases are added in `MetastoreDataSourcesSuite` to check the newly added syntax. Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12734 from liancheng/spark-14954.
* [SPARK-14867][BUILD] Remove `--force` option in `build/mvn`Dongjoon Hyun2016-04-272-22/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, `build/mvn` provides a convenient option, `--force`, in order to use the recommended version of maven without changing PATH environment variable. However, there were two problems. - `dev/lint-java` does not use the newly installed maven. ```bash $ ./build/mvn --force clean $ ./dev/lint-java Using `mvn` from path: /usr/local/bin/mvn ``` - It's not easy to type `--force` option always. If '--force' option is used once, we had better prefer the installed maven recommended by Spark. This PR makes `build/mvn` check the existence of maven installed by `--force` option first. According to the comments, this PR aims to the followings: - Detect the maven version from `pom.xml`. - Install maven if there is no or old maven. - Remove `--force` option. ## How was this patch tested? Manual. ```bash $ ./build/mvn --force clean $ ./dev/lint-java Using `mvn` from path: /Users/dongjoon/spark/build/apache-maven-3.3.9/bin/mvn ... $ rm -rf ./build/apache-maven-3.3.9/ $ ./dev/lint-java Using `mvn` from path: /usr/local/bin/mvn ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12631 from dongjoon-hyun/SPARK-14867.
* [SPARK-14664][SQL] Implement DecimalAggregates optimization for Window queriesDongjoon Hyun2016-04-273-12/+161
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR aims to implement decimal aggregation optimization for window queries by improving existing `DecimalAggregates`. Historically, `DecimalAggregates` optimizer is designed to transform general `sum/avg(decimal)`, but it breaks recently added windows queries like the followings. The following queries work well without the current `DecimalAggregates` optimizer. **Sum** ```scala scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").head java.lang.RuntimeException: Unsupported window function: MakeDecimal((sum(UnscaledValue(a#31)),mode=Complete,isDistinct=false),12,1) scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").explain() == Physical Plan == WholeStageCodegen : +- Project [sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#23] : +- INPUT +- Window [MakeDecimal((sum(UnscaledValue(a#21)),mode=Complete,isDistinct=false),12,1) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#23] +- Exchange SinglePartition, None +- Generate explode([1.0,2.0]), false, false, [a#21] +- Scan OneRowRelation[] ``` **Average** ```scala scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").head java.lang.RuntimeException: Unsupported window function: cast(((avg(UnscaledValue(a#40)),mode=Complete,isDistinct=false) / 10.0) as decimal(6,5)) scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").explain() == Physical Plan == WholeStageCodegen : +- Project [avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#44] : +- INPUT +- Window [cast(((avg(UnscaledValue(a#42)),mode=Complete,isDistinct=false) / 10.0) as decimal(6,5)) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#44] +- Exchange SinglePartition, None +- Generate explode([1.0,2.0]), false, false, [a#42] +- Scan OneRowRelation[] ``` After this PR, those queries work fine and new optimized physical plans look like the followings. **Sum** ```scala scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").explain() == Physical Plan == WholeStageCodegen : +- Project [sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#35] : +- INPUT +- Window [MakeDecimal((sum(UnscaledValue(a#33)),mode=Complete,isDistinct=false) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),12,1) AS sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#35] +- Exchange SinglePartition, None +- Generate explode([1.0,2.0]), false, false, [a#33] +- Scan OneRowRelation[] ``` **Average** ```scala scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").explain() == Physical Plan == WholeStageCodegen : +- Project [avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#47] : +- INPUT +- Window [cast(((avg(UnscaledValue(a#45)),mode=Complete,isDistinct=false) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) / 10.0) as decimal(6,5)) AS avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#47] +- Exchange SinglePartition, None +- Generate explode([1.0,2.0]), false, false, [a#45] +- Scan OneRowRelation[] ``` In this PR, *SUM over window* pattern matching is based on the code of hvanhovell ; he should be credited for the work he did. ## How was this patch tested? Pass the Jenkins tests (with newly added testcases) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12421 from dongjoon-hyun/SPARK-14664.
* [SPARK-14937][ML][DOCUMENT] spark.ml LogisticRegression sqlCtx in scala is ↵wm624@hotmail.com2016-04-273-7/+7
| | | | | | | | | | | | | | | | | inconsistent with java and python ## What changes were proposed in this pull request? In spark.ml document, the LogisticRegression scala example uses sqlCtx. It is inconsistent with java and python examples which use sqlContext. In addition, a user can't copy & paste to run the example in spark-shell as sqlCtx doesn't exist in spark-shell while sqlContext exists. Change the scala example referred by the spark.ml example. ## How was this patch tested? Compile the example scala file and it passes compilation. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12717 from wangmiao1981/doc.
* [SPARK-14930][SPARK-13693] Fix race condition in CheckpointWriter.stop()Josh Rosen2016-04-271-12/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CheckpointWriter.stop() is prone to a race condition: if one thread calls `stop()` right as a checkpoint write task begins to execute, that write task may become blocked when trying to access `fs`, the shared Hadoop FileSystem, since both the `fs` getter and `stop` method synchronize on the same lock. Here's a thread-dump excerpt which illustrates the problem: ```java "pool-31-thread-1" #156 prio=5 os_prio=31 tid=0x00007fea02cd2000 nid=0x5c0b waiting for monitor entry [0x000000013bc4c000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.spark.streaming.CheckpointWriter.org$apache$spark$streaming$CheckpointWriter$$fs(Checkpoint.scala:302) - waiting to lock <0x00000007bf53ee78> (a org.apache.spark.streaming.CheckpointWriter) at org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:224) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) "pool-1-thread-1-ScalaTest-running-MapWithStateSuite" #11 prio=5 os_prio=31 tid=0x00007fe9ff879800 nid=0x5703 waiting on condition [0x000000012e54c000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000007bf564568> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1465) at org.apache.spark.streaming.CheckpointWriter.stop(Checkpoint.scala:291) - locked <0x00000007bf53ee78> (a org.apache.spark.streaming.CheckpointWriter) at org.apache.spark.streaming.scheduler.JobGenerator.stop(JobGenerator.scala:159) - locked <0x00000007bf53ea90> (a org.apache.spark.streaming.scheduler.JobGenerator) at org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:115) - locked <0x00000007bf53d3f0> (a org.apache.spark.streaming.scheduler.JobScheduler) at org.apache.spark.streaming.StreamingContext$$anonfun$stop$1.apply$mcV$sp(StreamingContext.scala:680) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1219) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:679) - locked <0x00000007bf516a70> (a org.apache.spark.streaming.StreamingContext) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:644) - locked <0x00000007bf516a70> (a org.apache.spark.streaming.StreamingContext) [...] ``` We can fix this problem by having `stop` and `fs` be synchronized on different locks: the synchronization on `stop` only needs to guard against multiple threads calling `stop` at the same time, whereas the synchronization on `fs` is only necessary for cross-thread visibility. There's only ever a single active checkpoint writer thread at a time, so we don't need to guard against concurrent access to `fs`. Thus, `fs` can simply become a `volatile` var, similar to `lastCheckpointTime`. This change should fix [SPARK-13693](https://issues.apache.org/jira/browse/SPARK-13693), a flaky `MapWithStateSuite` test suite which has recently been failing several times per day. It also results in a huge test speedup: prior to this patch, `MapWithStateSuite` took about 80 seconds to run, whereas it now runs in less than 10 seconds. For the `streaming` project's tests as a whole, they now run in ~220 seconds vs. ~354 before. /cc zsxwing and tdas for review. Author: Josh Rosen <joshrosen@databricks.com> Closes #12712 from JoshRosen/fix-checkpoint-writer-race.
* [SPARK-14729][SCHEDULER] Refactored YARN scheduler creation code to use ↵Hemant Bhanawat2016-04-274-67/+58
| | | | | | | | | | | | | | newly added ExternalClusterManager ## What changes were proposed in this pull request? With the addition of ExternalClusterManager(ECM) interface in PR #11723, any cluster manager can now be integrated with Spark. It was suggested in ExternalClusterManager PR that one of the existing cluster managers should start using the new interface to ensure that the API is correct. Ideally, all the existing cluster managers should eventually use the ECM interface but as a first step yarn will now use the ECM interface. This PR refactors YARN code from SparkContext.createTaskScheduler function into YarnClusterManager that implements ECM interface. ## How was this patch tested? Since this is refactoring, no new tests has been added. Existing tests have been run. Basic manual testing with YARN was done too. Author: Hemant Bhanawat <hemant@snappydata.io> Closes #12641 from hbhanawat/yarnClusterMgr.
* [SPARK-9656][MLLIB][PYTHON] Add missing methods to PySpark's Distributed ↵Mike Dusenberry2016-04-273-4/+301
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Linear Algebra Classes This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows: * `RowMatrix` <sup>**[1]**</sup> 1. `computeGramianMatrix` 2. `computeCovariance` 3. `computeColumnSummaryStatistics` 4. `columnSimilarities` 5. `tallSkinnyQR` <sup>**[2]**</sup> * `IndexedRowMatrix` <sup>**[3]**</sup> 1. `computeGramianMatrix` * `CoordinateMatrix` 1. `transpose` * `BlockMatrix` 1. `validate` 2. `cache` 3. `persist` 4. `transpose` **[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR #7963 for SPARK-6227. **[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR. Therefore, I have marked this PR as WIP and am open to discussion. **[3]**: Note: `multiply` and `computeSVD` are already part of PR #7963 for SPARK-6227. Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.
* [SPARK-14874][SQL][STREAMING] Remove the obsolete Batch representationLiwei Lin2016-04-275-30/+4
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The `Batch` class, which had been used to indicate progress in a stream, was abandoned by [[SPARK-13985][SQL] Deterministic batches with ids](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b) and then became useless. This patch: - removes the `Batch` class - ~~does some related renaming~~ (update: this has been reverted) - fixes some related comments ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #12638 from lw-lin/remove-batch.
* [SPARK-14950][SQL] Fix BroadcastHashJoin's unique key Anti-JoinsHerman van Hovell2016-04-272-14/+67
| | | | | | | | | | | | | | ### What changes were proposed in this pull request? Anti-Joins using BroadcastHashJoin's unique key code path are broken; it currently returns Semi Join results . This PR fixes this bug. ### How was this patch tested? Added tests cases to `ExistenceJoinSuite`. cc davies gatorsmile Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12730 from hvanhovell/SPARK-14950.
* [SPARK-14949][SQL] Remove HiveConf dependency from InsertIntoHiveTableReynold Xin2016-04-271-16/+14
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch removes the use of HiveConf from InsertIntoHiveTable. I think this is the last major use of HiveConf and after this we can try to remove the execution HiveConf. ## How was this patch tested? Internal refactoring and should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12728 from rxin/SPARK-14949.
* Unintentional white spaces in kryo classes configuration parametersVictor Chima2016-04-272-3/+4
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Pruned off white spaces present in the user provided comma separated list of classes for **spark.kryo.classesToRegister** and **spark.kryo.registrator**. ## How was this patch tested? Manual tests Author: Victor Chima <blazy2k9@gmail.com> Closes #12701 from blazy2k9/master.