aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-13783][ML] Model export/import for spark.ml: GBTsYanbo Liang2016-04-138-137/+262
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? * Added save/load for ```GBTClassifier/GBTClassificationModel/GBTRegressor/GBTRegressionModel```. * Meanwhile, I modified ```EnsembleModelReadWrite.saveImpl/loadImpl``` to support save/load ```treeWeights```. ## How was this patch tested? Adds standard unit tests for GBT save/load. cc jkbradley GayathriMurali Author: Yanbo Liang <ybliang8@gmail.com> Closes #12230 from yanboliang/spark-13783.
* [SPARK-14388][SQL] Implement CREATE TABLEAndrew Or2016-04-1318-262/+571
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch implements the `CREATE TABLE` command using the `SessionCatalog`. Previously we handled only `CTAS` and `CREATE TABLE ... USING`. This requires us to refactor `CatalogTable` to accept various fields (e.g. bucket and skew columns) and pass them to Hive. WIP: Note that I haven't verified whether this actually works yet! But I believe it does. ## How was this patch tested? Tests will come in a future commit. Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12271 from andrewor14/create-table-ddl.
* [SPARK-14568][ML] Instrumentation framework for logistic regressionTimothy Hunter2016-04-132-1/+127
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This adds extra logging information about a `LogisticRegression` estimator when being fit on a dataset. With this PR, you see the following extra lines when running the example in the documentation: ``` 16/04/13 07:19:00 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_55dd3c09f164-1230977381-1): training: numPartitions=1 storageLevel=StorageLevel(disk=true, memory=true, offheap=false, deserialized=true, replication=1) 16/04/13 07:19:00 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_55dd3c09f164-1230977381-1): {"regParam":0.3,"elasticNetParam":0.8,"maxIter":10} ... 16/04/12 11:48:07 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_a89eb23cb386-358781145):numClasses=2 16/04/12 11:48:07 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_a89eb23cb386-358781145):numFeatures=692 ... 16/04/13 07:19:01 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_55dd3c09f164-1230977381-1): training finished ``` ## How was this patch tested? This PR was manually tested. Author: Timothy Hunter <timhunter@databricks.com> Closes #12331 from thunterdb/1604-instrumentation.
* Revert "[SPARK-14154][MLLIB] Simplify the implementation for ↵Xiangrui Meng2016-04-131-4/+73
| | | | | | Kolmogorov–Smirnov test" This reverts commit d2a819a6363190b946986ebf6f8001d520098c3b.
* [SPARK-14537][CORE] Make TaskSchedulerImpl waiting fail if context is shut downCharles Allen2016-04-131-0/+5
| | | | | | | | This patch makes the postStartHook throw an IllegalStateException if the SparkContext is shutdown while it is waiting for the backend to be ready Author: Charles Allen <charles@allen-net.com> Closes #12301 from drcrallen/SPARK-14537.
* [SPARK-13992][CORE][PYSPARK][FOLLOWUP] Update OFF_HEAP semantics for Java ↵Liwei Lin2016-04-122-2/+2
| | | | | | | | | | | | | | | | | api and Python api ## What changes were proposed in this pull request? - updated `OFF_HEAP` semantics for `StorageLevels.java` - updated `OFF_HEAP` semantics for `storagelevel.py` ## How was this patch tested? no need to test Author: Liwei Lin <lwlin7@gmail.com> Closes #12126 from lw-lin/storagelevel.py.
* [SPARK-14554][SQL][FOLLOW-UP] use checkDataset to check the resultWenchen Fan2016-04-131-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? address this comment: https://github.com/apache/spark/pull/12322#discussion_r59417359 ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #12346 from cloud-fan/tmp.
* [MINOR][SQL] Remove some unused imports in datasources.hyukjinkwon2016-04-1310-37/+10
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? It looks several recent commits for datasources (maybe while removing old `HadoopFsRelation` interface) missed removing some unused imports. This PR removes some unused imports in datasources. ## How was this patch tested? `sbt scalastyle` and some unit tests for them. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12326 from HyukjinKwon/minor-imports.
* [SPARK-14579][SQL] Fix a race condition in StreamExecution.processAllAvailableShixiong Zhu2016-04-121-13/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? There is a race condition in `StreamExecution.processAllAvailable`. Here is an execution order to reproduce it. | Time |Thread 1 | MicroBatchThread | |:-------------:|:-------------:|:-----:| | 1 | | `dataAvailable in constructNextBatch` returns false | | 2 | addData(newData) | | | 3 | `noNewData = false` in processAllAvailable | | | 4 | | noNewData = true | | 5 | `noNewData` is true so just return | | The root cause is that `checking dataAvailable and change noNewData to true` is not atomic. This PR puts these two actions into `synchronized` to make sure they are atomic. In addition, this PR also has the following changes: - Make `committedOffsets` and `availableOffsets` volatile to make sure they can be seen in other threads. - Copy the reference of `availableOffsets` to a local variable so that `sourceStatuses` can use a snapshot of `availableOffsets`. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12339 from zsxwing/race-condition.
* [SPARK-14578] [SQL] Fix codegen for CreateExternalRow with nested wide schemaDavies Liu2016-04-122-3/+20
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? The wide schema, the expression of fields will be splitted into multiple functions, but the variable for loopVar can't be accessed in splitted functions, this PR change them as class member. ## How was this patch tested? Added regression test. Author: Davies Liu <davies@databricks.com> Closes #12338 from davies/nested_row.
* [SPARK-14363] Fix executor OOM due to memory leak in the SorterSital Kedia2016-04-124-4/+23
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix memory leak in the Sorter. When the UnsafeExternalSorter spills the data to disk, it does not free up the underlying pointer array. As a result, we see a lot of executor OOM and also memory under utilization. This is a regression partially introduced in PR https://github.com/apache/spark/pull/9241 ## How was this patch tested? Tested by running a job and observed around 30% speedup after this change. Author: Sital Kedia <skedia@fb.com> Closes #12285 from sitalkedia/executor_oom.
* [SPARK-14547] Avoid DNS resolution for reusing connectionsReynold Xin2016-04-121-11/+20
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch changes the connection creation logic in the network client module to avoid DNS resolution when reusing connections. ## How was this patch tested? Testing in production. This is too difficult to test in isolation (for high fidelity unit tests, we'd need to change the DNS resolution behavior in the JVM). Author: Reynold Xin <rxin@databricks.com> Closes #12315 from rxin/SPARK-14547.
* [SPARK-14544] [SQL] improve performance of SQL UI tabDavies Liu2016-04-123-35/+17
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR improve the performance of SQL UI by: 1) remove the details column in all executions page (the first page in SQL tab). We can check the details by enter the execution page. 2) break-all is super slow in Chrome recently, so switch to break-word. 3) Using "display: none" to hide a block. 4) using one js closure for for all the executions, not one for each. 5) remove the height limitation of details, don't need to scroll it in the tiny window. ## How was this patch tested? Exists tests. ![ui](https://cloud.githubusercontent.com/assets/40902/14445712/68d7b258-0004-11e6-9b48-5d329b05d165.png) Author: Davies Liu <davies@databricks.com> Closes #12311 from davies/ui_perf.
* [SPARK-14513][CORE] Fix threads left behind after stopping SparkContextTerence Yim2016-04-123-2/+21
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Shutting down `QueuedThreadPool` used by Jetty `Server` to avoid threads leakage after SparkContext is stopped. Note: If this fix is going to apply to the `branch-1.6`, one more patch on the `NettyRpcEnv` class is needed so that the `NettyRpcEnv._fileServer.shutdown` is called in the `NettyRpcEnv.cleanup` method. This is due to the removal of `_fileServer` field in the `NettyRpcEnv` class in the master branch. Please advice if a second PR is necessary for bring this fix back to `branch-1.6` ## How was this patch tested? Ran the ./dev/run-tests locally Author: Terence Yim <terence@cask.co> Closes #12318 from chtyim/fixes/SPARK-14513-thread-leak.
* [SPARK-14414][SQL] improve the error message class hierarchybomeng2016-04-124-27/+9
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Before we are using `AnalysisException`, `ParseException`, `NoSuchFunctionException` etc when a parsing error encounters. I am trying to make it consistent and also **minimum** code impact to the current implementation by changing the class hierarchy. 1. `NoSuchItemException` is removed, since it is an abstract class and it just simply takes a message string. 2. `NoSuchDatabaseException`, `NoSuchTableException`, `NoSuchPartitionException` and `NoSuchFunctionException` now extends `AnalysisException`, as well as `ParseException`, they are all under `AnalysisException` umbrella, but you can also determine how to use them in a granular way. ## How was this patch tested? The existing test cases should cover this patch. Author: bomeng <bmeng@us.ibm.com> Closes #12314 from bomeng/SPARK-14414.
* [SPARK-14562] [SQL] improve constraints propagation in UnionDavies Liu2016-04-122-1/+29
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, Union only takes intersect of the constraints from it's children, all others are dropped, we should try to merge them together. This PR try to merge the constraints that have the same reference but came from different children, for example: `a > 10` and `a < 100` could be merged as `a > 10 || a < 100`. ## How was this patch tested? Added more cases in existing test. Author: Davies Liu <davies@databricks.com> Closes #12328 from davies/union_const.
* [SPARK-14556][SQL] Code clean-ups for package ↵Liwei Lin2016-04-125-32/+31
| | | | | | | | | | | | | | | | | | | o.a.s.sql.execution.streaming.state ## What changes were proposed in this pull request? - `StateStoreConf.**max**DeltasForSnapshot` was renamed to `StateStoreConf.**min**DeltasForSnapshot` - some state switch checks were added - improved consistency between method names and string literals - other comments & typo fix ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #12323 from lw-lin/streaming-state-clean-up.
* [SPARK-14147][ML][SPARKR] SparkR predict should not output feature columnYanbo Liang2016-04-122-2/+4
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? SparkR does not support type of vector which is the default type of feature column in ML. R predict also does not output intermediate feature column. So SparkR ```predict``` should not output feature column. In this PR, I only fix this issue for ```naiveBayes``` and ```survreg```. ```kmeans``` has the right code route already and ```glm``` will be fixed at SparkRWrapper refactor(#12294). ## How was this patch tested? No new tests. cc mengxr shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #11958 from yanboliang/spark-14147.
* [SPARK-14563][ML] use a random table name instead of __THIS__ in SQLTransformerXiangrui Meng2016-04-122-4/+16
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Use a random table name instead of `__THIS__` in SQLTransformer, and add a test for `transformSchema`. The problems of using `__THIS__` are: * It doesn't work under HiveContext (in Spark 1.6) * Race conditions ## How was this patch tested? * Manual test with HiveContext. * Added a unit test for `transformSchema` to improve coverage. cc: yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #12330 from mengxr/SPARK-14563.
* [SPARK-13597][PYSPARK][ML] Python API for GeneralizedLinearRegressionKai Jiang2016-04-121-0/+145
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Python API for GeneralizedLinearRegression JIRA: https://issues.apache.org/jira/browse/SPARK-13597 ## How was this patch tested? The patch is tested with Python doctest. Author: Kai Jiang <jiangkai@gmail.com> Closes #11468 from vectorijk/spark-13597.
* [SPARK-13322][ML] AFTSurvivalRegression supports feature standardizationYanbo Liang2016-04-122-34/+93
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? AFTSurvivalRegression should support feature standardization, it will improve the convergence rate. Test the convergence rate on the [Ovarian](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/ovarian.html) data which is standard data comes with Survival library in R, * without standardization(before this PR) -> 74 iterations. * with standardization(after this PR) -> 38 iterations. But after this fix, with or without ```standardization``` will converge to the same solution. It means that ```standardization = false``` will run the same code route as ```standardization = true```. Because if the features are not standardized at all, it will result convergency issue when the features have very different scales. This behavior is the same as ML [```LinearRegression``` and ```LogisticRegression```](https://issues.apache.org/jira/browse/SPARK-8522). See more discussion about this topic at #11247. cc mengxr ## How was this patch tested? unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11365 from yanboliang/spark-13322.
* [SPARK-12566][SPARK-14324][ML] GLM model family, link function support in ↵Yanbo Liang2016-04-124-259/+169
| | | | | | | | | | | | | | | | | SparkR:::glm * SparkR glm supports families and link functions which match R's signature for family. * SparkR glm API refactor. The comparative standard of the new API is R glm, so I only expose the arguments that R glm supports: ```formula, family, data, epsilon and maxit```. * This PR is focus on glm() and predict(), summary statistics will be done in a separate PR after this get in. * This PR depends on #12287 which make GLMs support link prediction at Scala side. After that merged, I will add more tests for predict() to this PR. Unit tests. cc mengxr jkbradley hhbyyh Author: Yanbo Liang <ybliang8@gmail.com> Closes #12294 from yanboliang/spark-12566.
* [SPARK-14474][SQL] Move FileSource offset log into checkpointLocationShixiong Zhu2016-04-128-33/+141
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Now that we have a single location for storing checkpointed state. This PR just propagates the checkpoint location into FileStreamSource so that we don't have one random log off on its own. ## How was this patch tested? test("metadataPath should be in checkpointLocation") Author: Shixiong Zhu <shixiong@databricks.com> Closes #12247 from zsxwing/file-source-log-location.
* [SPARK-3724][ML] RandomForest: More options for feature subset size.Yong Tang2016-04-126-3/+95
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR tries to support more options for feature subset size in RandomForest implementation. Previously, RandomForest only support "auto", "all", "sort", "log2", "onethird". This PR tries to support any given value to allow model search. In this PR, `featureSubsetStrategy` could be passed with: a) a real number in the range of `(0.0-1.0]` that represents the fraction of the number of features in each subset, b) an integer number (`>0`) that represents the number of features in each subset. ## How was this patch tested? Two tests `JavaRandomForestClassifierSuite` and `JavaRandomForestRegressorSuite` have been updated to check the additional options for params in this PR. An additional test has been added to `org.apache.spark.mllib.tree.RandomForestSuite` to cover the cases in this PR. Author: Yong Tang <yong.tang.github@outlook.com> Closes #11989 from yongtang/SPARK-3724.
* [SPARK-14488][SPARK-14493][SQL] "CREATE TEMPORARY TABLE ... USING ... AS ↵Cheng Lian2016-04-122-6/+53
| | | | | | | | | | | | | | | | | | SELECT" shouldn't create persisted table ## What changes were proposed in this pull request? When planning logical plan node `CreateTableUsingAsSelect`, we neglected its `temporary` field and always generates a `CreateMetastoreDataSourceAsSelect`. This PR fixes this issue generating `CreateTempTableUsingAsSelect` when `temporary` is true. This PR also fixes SPARK-14493 since the root cause of SPARK-14493 is that we were `CreateMetastoreDataSourceAsSelect` uses default Hive warehouse location when `PATH` data source option is absent. ## How was this patch tested? Added a test case to create a temporary table using the target syntax and check whether it's indeed a temporary table. Author: Cheng Lian <lian@databricks.com> Closes #12303 from liancheng/spark-14488-fix-ctas-using.
* [SPARK-14508][BUILD] Add a new ScalaStyle Rule `OmitBracesInCase`Dongjoon Hyun2016-04-1264-293/+164
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule. ``` case: Always omit braces in case clauses. ``` This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code. ## How was this patch tested? Pass the Jenkins tests (including Scala style checking) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12280 from dongjoon-hyun/SPARK-14508.
* [SPARK-14535][SQL] Remove buildInternalScan from FileFormatWenchen Fan2016-04-1111-681/+5
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Now `HadoopFsRelation` with all kinds of file formats can be handled in `FileSourceStrategy`, we can remove the branches for `HadoopFsRelation` in `FileSourceStrategy` and the `buildInternalScan` API from `FileFormat`. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12300 from cloud-fan/remove.
* [SPARK-14554][SQL] disable whole stage codegen if there are too many input ↵Wenchen Fan2016-04-112-2/+11
| | | | | | | | | | | | | | | | | | | | | | columns ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/12047/files#diff-94a1f59bcc9b6758c4ca874652437634R529, we may split field expressions codes in `CreateExternalRow` to support wide table. However, the whole stage codegen framework doesn't support it, because the input for expressions is not always the input row, but can be `CodeGenContext.currentVars`, which doesn't work well with `CodeGenContext.splitExpressions`. Actually we do have a check to guard against this cases, but it's incomplete, it only checks output fields. This PR improves the whole stage codegen support check, to disable it if there are too many input fields, so that we can avoid splitting field expressions codes in `CreateExternalRow` for whole stage codegen. TODO: Is it a better solution if we can make `CodeGenContext.currentVars` work well with `CodeGenContext.splitExpressions`? ## How was this patch tested? new test in DatasetSuite. Author: Wenchen Fan <wenchen@databricks.com> Closes #12322 from cloud-fan/codegen.
* [SPARK-14362][SPARK-14406][SQL][FOLLOW-UP] DDL Native Support: Drop View and ↵gatorsmile2016-04-111-24/+26
| | | | | | | | | | | | | | | | Drop Table #### What changes were proposed in this pull request? In this PR, we are trying to address the comment in the original PR: https://github.com/apache/spark/commit/dfce9665c4b2b29a19e6302216dae2800da68ff9#commitcomment-17057030 In this PR, we checks if table/view exists at the beginning and then does not need to capture the exceptions, including `NoSuchTableException` and `InvalidTableException`. We still capture the NonFatal exception when doing `sqlContext.cacheManager.tryUncacheQuery`. #### How was this patch tested? The existing test cases should cover the code changes of this PR. Author: gatorsmile <gatorsmile@gmail.com> Closes #12321 from gatorsmile/dropViewFollowup.
* [SPARK-14132][SPARK-14133][SQL] Alter table partition DDLsAndrew Or2016-04-1112-206/+360
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This implements a few alter table partition commands using the `SessionCatalog`. In particular: ``` ALTER TABLE ... ADD PARTITION ... ALTER TABLE ... DROP PARTITION ... ALTER TABLE ... RENAME PARTITION ... TO ... ``` The following operations are not supported, and an `AnalysisException` with a helpful error message will be thrown if the user tries to use them: ``` ALTER TABLE ... EXCHANGE PARTITION ... ALTER TABLE ... ARCHIVE PARTITION ... ALTER TABLE ... UNARCHIVE PARTITION ... ALTER TABLE ... TOUCH ... ALTER TABLE ... COMPACT ... ALTER TABLE ... CONCATENATE MSCK REPAIR TABLE ... ``` ## How was this patch tested? `DDLSuite`, `DDLCommandSuite` and `HiveDDLCommandSuite` Author: Andrew Or <andrew@databricks.com> Closes #12220 from andrewor14/alter-partition-ddl.
* [MINOR][ML] Fixed MLlib build warningsJoseph K. Bradley2016-04-122-0/+4
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fixes to eliminate warnings during package and doc builds. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12263 from jkbradley/warning-cleanups.
* [SPARK-14520][SQL] Use correct return type in VectorizedParquetInputFormatLiang-Chi Hsieh2016-04-111-2/+2
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-14520 `VectorizedParquetInputFormat` inherits `ParquetInputFormat` and overrides `createRecordReader`. However, its overridden `createRecordReader` returns a `ParquetRecordReader`. It should return a `RecordReader`. Otherwise, `ClassCastException` will be thrown. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12292 from viirya/fix-vectorized-input-format.
* [SPARK-14475] Propagate user-defined context from driver to executorsEric Liang2016-04-1124-36/+138
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This adds a new API call `TaskContext.getLocalProperty` for getting properties set in the driver from executors. These local properties are automatically propagated from the driver to executors. For streaming, the context for streaming tasks will be the initial driver context when ssc.start() is called. ## How was this patch tested? Unit tests. cc JoshRosen Author: Eric Liang <ekl@databricks.com> Closes #12248 from ericl/sc-2813.
* [SPARK-10521][SQL] Utilize Docker for test DB2 JDBC Dialect supportLuciano Resende2016-04-118-4/+215
| | | | | | | | Add integration tests based on docker to test DB2 JDBC dialect support Author: Luciano Resende <lresende@apache.org> Closes #9893 from lresende/SPARK-10521.
* [SPARK-14298][ML][MLLIB] Add unit test for EM LDA disable checkpointingYanbo Liang2016-04-111-0/+11
| | | | | | | | | | | | | ## What changes were proposed in this pull request? This is follow up for #12089, add unit test for EM LDA which test disable checkpointing when set ```checkpointInterval = -1```. ## How was this patch tested? unit test. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12286 from yanboliang/spark-14298-followup.
* [SPARK-13600][MLLIB] Use approxQuantile from DataFrame stats in ↵Oliver Pierson2016-04-112-169/+65
| | | | | | | | | | | | | QuantileDiscretizer ## What changes were proposed in this pull request? QuantileDiscretizer can return an unexpected number of buckets in certain cases. This PR proposes to fix this issue and also refactor QuantileDiscretizer to use approxQuantiles from DataFrame stats functions. ## How was this patch tested? QuantileDiscretizerSuite unit tests (some existing tests will change or even be removed in this PR) Author: Oliver Pierson <ocp@gatech.edu> Closes #11553 from oliverpierson/SPARK-13600.
* [SPARK-14494][SQL] Fix the race conditions in MemoryStream and MemorySinkShixiong Zhu2016-04-111-9/+16
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Make sure accessing mutable variables in MemoryStream and MemorySink are protected by `synchronized`. This is probably why MemorySinkSuite failed here: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.2/650/testReport/junit/org.apache.spark.sql.streaming/MemorySinkSuite/registering_as_a_table/ ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12261 from zsxwing/memory-race-condition.
* [SPARK-14502] [SQL] Add optimization for Binary Comparison SimplificationDongjoon Hyun2016-04-112-0/+119
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? We can simplifies binary comparisons with semantically-equal operands: 1. Replace '<=>' with 'true' literal. 2. Replace '=', '<=', and '>=' with 'true' literal if both operands are non-nullable. 3. Replace '<' and '>' with 'false' literal if both operands are non-nullable. For example, the following example plan ``` scala> sql("SELECT * FROM (SELECT explode(array(1,2,3)) a) T WHERE a BETWEEN a AND a+7").explain() ... : +- Filter ((a#59 >= a#59) && (a#59 <= (a#59 + 7))) ... ``` will be optimized into the following. ``` : +- Filter (a#47 <= (a#47 + 7)) ``` ## How was this patch tested? Pass the Jenkins tests including new `BinaryComparisonSimplificationSuite`. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12267 from dongjoon-hyun/SPARK-14502.
* [SPARK-14528] [SQL] Fix same result of UnionDavies Liu2016-04-112-5/+9
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fix resultResult() for Union. ## How was this patch tested? Added regression test. Author: Davies Liu <davies@databricks.com> Closes #12295 from davies/fix_sameResult.
* [SPARK-14462][ML][MLLIB] Add the mllib-local build to maven pomDB Tsai2016-04-117-4/+167
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The previous PR was failing the build because of `spark-core:test` dependency, and that was reverted. In this PR, `FunSuite` with `// scalastyle:ignore funsuite` in mllib-local test was used, similar to sketch. Thanks. ## How was this patch tested? Unit tests mengxr tedyu holdenk Author: DB Tsai <dbt@netflix.com> Closes #12298 from dbtsai/dbtsai-mllib-local-build-fix.
* [SPARK-14510][MLLIB] Add args-checking for LDA and StreamingKMeansZheng RuiFeng2016-04-112-3/+17
| | | | | | | | | | | | ## What changes were proposed in this pull request? add the checking for LDA and StreamingKMeans ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12062 from zhengruifeng/initmodel.
* [SPARK-14500] [ML] Accept Dataset[_] instead of DataFrame in MLlib APIsXiangrui Meng2016-04-1175-240/+296
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR updates MLlib APIs to accept `Dataset[_]` as input where `DataFrame` was the input type. This PR doesn't change the output type. In Java, `Dataset[_]` maps to `Dataset<?>`, which includes `Dataset<Row>`. Some implementations were changed in order to return `DataFrame`. Tests and examples were updated. Note that this is a breaking change for subclasses of Transformer/Estimator. Lol, we don't have to rename the input argument, which has been `dataset` since Spark 1.2. TODOs: - [x] update MiMaExcludes (seems all covered by explicit filters from SPARK-13920) - [x] Python - [x] add a new test to accept Dataset[LabeledPoint] - [x] remove unused imports of Dataset ## How was this patch tested? Exiting unit tests with some modifications. cc: rxin jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #12274 from mengxr/SPARK-14500.
* [SPARK-14372][SQL] Dataset.randomSplit() needs a Java versionRekha Joshi2016-04-112-1/+26
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1.Added method randomSplitAsList() in Dataset for java for https://issues.apache.org/jira/browse/SPARK-14372 ## How was this patch tested? TestSuite Author: Rekha Joshi <rekhajoshm@gmail.com> Author: Joshi <rekhajoshm@gmail.com> Closes #12184 from rekhajoshm/SPARK-14372.
* [MINOR][DOCS] Fix wrong data types in JSON Datasets example.Dongjoon Hyun2016-04-111-4/+4
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes the `age` data types from `integer` to `long` in `SQL Programming Guide: JSON Datasets`. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12290 from dongjoon-hyun/minor_fix_type_in_json_example.
* [SPARK-14362][SPARK-14406][SQL][FOLLOW-UP] DDL Native Support: Drop View and ↵gatorsmile2016-04-1012-24/+24
| | | | | | | | | | | | | | Drop Table #### What changes were proposed in this pull request? This PR is to address the comment: https://github.com/apache/spark/pull/12146#discussion-diff-59092238. It removes the function `isViewSupported` from `SessionCatalog`. After the removal, we still can capture the user errors if users try to drop a table using `DROP VIEW`. #### How was this patch tested? Modified the existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #12284 from gatorsmile/followupDropTable.
* [SPARK-14419] [MINOR] coding style cleanupDavies Liu2016-04-102-24/+13
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Making them more consistent. ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #12289 from davies/cleanup_style.
* [SPARK-14415][SQL] All functions should show usages by command `DESC FUNCTION`Dongjoon Hyun2016-04-1028-25/+489
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, many functions do now show usages like the followings. ``` scala> sql("desc function extended `sin`").collect().foreach(println) [Function: sin] [Class: org.apache.spark.sql.catalyst.expressions.Sin] [Usage: To be added.] [Extended Usage: To be added.] ``` This PR adds descriptions for functions and adds a testcase prevent adding function without usage. ``` scala> sql("desc function extended `sin`").collect().foreach(println); [Function: sin] [Class: org.apache.spark.sql.catalyst.expressions.Sin] [Usage: sin(x) - Returns the sine of x.] [Extended Usage: > SELECT sin(0); 0.0] ``` The only exceptions are `cube`, `grouping`, `grouping_id`, `rollup`, `window`. ## How was this patch tested? Pass the Jenkins tests (including new testcases.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12185 from dongjoon-hyun/SPARK-14415.
* Update KMeansExample.scalaÖrjan Lundberg2016-04-101-1/+1
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? example does not work wo DataFrame import ## How was this patch tested? example doc only example does not work wo DataFrame import Author: Örjan Lundberg <orjan.lundberg@gmail.com> Closes #12277 from oluies/patch-1.
* [SPARK-14497][ML] Use top instead of sortBy() to get top N frequent words as ↵fwang12016-04-102-13/+8
| | | | | | | | | | | | | | | dict in ConutVectorizer ## What changes were proposed in this pull request? Replace sortBy() with top() to calculate the top N frequent words as dictionary. ## How was this patch tested? existing unit tests. The terms with same TF would be sorted in descending order. The test would fail if hardcode the terms with same TF the dictionary like "c", "d"... Author: fwang1 <desperado.wf@gmail.com> Closes #12265 from lionelfeng/master.
* [SPARK-14357][CORE] Properly handle the root cause being a commit denied ↵Jason Moore2016-04-093-1/+93
| | | | | | | | | | | | | | | | | exception ## What changes were proposed in this pull request? When deciding whether a CommitDeniedException caused a task to fail, consider the root cause of the Exception. ## How was this patch tested? Added a test suite for the component that extracts the root cause of the error. Made a distribution after cherry-picking this commit to branch-1.6 and used to run our Spark application that would quite often fail due to the CommitDeniedException. Author: Jason Moore <jasonmoore2k@outlook.com> Closes #12228 from jasonmoore2k/SPARK-14357.