spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[MINOR][BUILD] Enable RAT checking on `LZ4BlockInputStream.java`.	Dongjoon Hyun	2016-04-27	2	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Since `LZ4BlockInputStream.java` is not licensed to Apache Software Foundation (ASF), the Apache License header of that file is not monitored until now. This PR aims to enable RAT checking on `LZ4BlockInputStream.java` by excluding from `dev/.rat-excludes`. This will prevent accidental removal of Apache License header from that file. ## How was this patch tested? Pass the Jenkins tests (Specifically, RAT check stage). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12677 from dongjoon-hyun/minor_rat_exclusion_file.
*	[SPARK-14130][SQL] Throw exceptions for ALTER TABLE ADD/REPLACE/CHANGE ↵	Yin Huai	2016-04-27	184	-819/+194
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands ## What changes were proposed in this pull request? This PR will make Spark SQL not allow ALTER TABLE ADD/REPLACE/CHANGE COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands. ## How was this patch tested? Existing tests. For those tests that I put in the blacklist, I am adding the useful parts back to SQLQuerySuite. Author: Yin Huai <yhuai@databricks.com> Closes #12714 from yhuai/banNativeCommand.
*	[SPARK-14944][SPARK-14943][SQL] Remove HiveConf from HiveTableScanExec, ↵	Reynold Xin	2016-04-26	6	-46/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	HiveTableReader, and ScriptTransformation ## What changes were proposed in this pull request? This patch removes HiveConf from HiveTableScanExec and HiveTableReader and instead just uses our own configuration system. I'm splitting the large change of removing HiveConf into multiple independent pull requests because it is very difficult to debug test failures when they are all combined in one giant one. ## How was this patch tested? Should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12727 from rxin/SPARK-14944.
*	[SPARK-14911] [CORE] Fix a potential data race in TaskMemoryManager	Liwei Lin	2016-04-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? [[SPARK-13210][SQL] catch OOM when allocate memory and expand array](https://github.com/apache/spark/commit/37bc203c8dd5022cb11d53b697c28a737ee85bcc) introduced an `acquiredButNotUsed` field, but it might not be correctly synchronized: - the write `acquiredButNotUsed += acquired` is guarded by `this` lock (see [here](https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L271)); - the read `memoryManager.releaseExecutionMemory(acquiredButNotUsed, taskAttemptId, tungstenMemoryMode)` (see [here](https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L400)) might not be correctly synchronized, and thus might not see `acquiredButNotUsed`'s most recent value. This patch makes `acquiredButNotUsed` volatile to fix this. ## How was this patch tested? This should be covered by existing suits. Author: Liwei Lin <lwlin7@gmail.com> Closes #12681 from lw-lin/fix-acquiredButNotUsed.
*	[SPARK-14913][SQL] Simplify configuration API	Reynold Xin	2016-04-26	42	-671/+368
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We currently expose both Hadoop configuration and Spark SQL configuration in RuntimeConfig. I think we can remove the Hadoop configuration part, and simply generate Hadoop Configuration on the fly by passing all the SQL configurations into it. This way, there is a single interface (in Java/Scala/Python/SQL) for end-users. As part of this patch, I also removed some config options deprecated in Spark 1.x. ## How was this patch tested? Updated relevant tests. Author: Reynold Xin <rxin@databricks.com> Closes #12689 from rxin/SPARK-14913.
*	[SPARK-13477][SQL] Expose new user-facing Catalog interface	Andrew Or	2016-04-26	31	-325/+1090
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? #12625 exposed a new user-facing conf interface in `SparkSession`. This patch adds a catalog interface. ## How was this patch tested? See `CatalogSuite`. Author: Andrew Or <andrew@databricks.com> Closes #12713 from andrewor14/user-facing-catalog.
*	[SPARK-14445][SQL] Support native execution of SHOW COLUMNS and SHOW PARTITIONS	Dilip Biswal	2016-04-27	16	-31/+401
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds Native execution of SHOW COLUMNS and SHOW PARTITION commands. Command Syntax: ``` SQL SHOW COLUMNS (FROM \| IN) table_identifier [(FROM \| IN) database] ``` ``` SQL SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)] ``` ## How was this patch tested? Added test cases in HiveCommandSuite to verify execution and DDLCommandSuite to verify plans. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #12222 from dilipbiswal/dkb_show_columns.
*	[SPARK-14732][ML] spark.ml GaussianMixture should use MultivariateGaussian ↵	Joseph K. Bradley	2016-04-26	7	-44/+353
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in mllib-local ## What changes were proposed in this pull request? Before, spark.ml GaussianMixtureModel used the spark.mllib MultivariateGaussian in its public API. This was added after 1.6, so we can modify this API without breaking APIs. This PR copies MultivariateGaussian to mllib-local in spark.ml, with a few changes: * Renamed fields to match numpy, scipy: mu => mean, sigma => cov This PR then uses the spark.ml MultivariateGaussian in the spark.ml GaussianMixtureModel, which involves: * Modifying the constructor * Adding a computeProbabilities method Also: * Added EPSILON to mllib-local for use in MultivariateGaussian ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12593 from jkbradley/sparkml-gmm-fix.
*	[SPARK-13734][SPARKR] Added histogram function	Oscar D. Lara Yejas	2016-04-26	4	-0/+170
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Added method histogram() to compute the histogram of a Column Usage: ``` ## Create a DataFrame from the Iris dataset irisDF <- createDataFrame(sqlContext, iris) ## Render a histogram for the Sepal_Length column histogram(irisDF, "Sepal_Length", nbins=12) ``` ![histogram](https://cloud.githubusercontent.com/assets/13985649/13588486/e1e751c6-e484-11e5-85db-2fc2115c4bb2.png) Note: Usage will change once SPARK-9325 is figured out so that histogram() only takes a Column as a parameter, as opposed to a DataFrame and a name ## How was this patch tested? All unit tests pass. I added specific unit cases for different scenarios. Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com> Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net> Closes #11569 from olarayej/SPARK-13734.
*	[SPARK-14925][BUILD] Re-introduce 'unused' dependency so that published POMs ↵	Josh Rosen	2016-04-26	3	-0/+31
\| \| \| \| \| \| \| \| \| \|	are flattened Spark's published POMs are supposed to be flattened and not contain variable substitution (see SPARK-3812), but the dummy dependency that was required for this was accidentally removed. We should re-introduce this dependency in order to fix an issue where the un-flattened POMs cause the wrong dependencies to be included in Scala 2.10 published POMs. Author: Josh Rosen <joshrosen@databricks.com> Closes #12706 from JoshRosen/SPARK-14925-published-poms-should-be-flattened.
*	[SPARK-14929] [SQL] Disable vectorized map for wide schemas & high-precision ↵	Sameer Agarwal	2016-04-26	4	-31/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	decimals ## What changes were proposed in this pull request? While the vectorized hash map in `TungstenAggregate` is currently supported for all primitive data types during partial aggregation, this patch only enables the hash map for a subset of cases that've been verified to show performance improvements on our benchmarks subject to an internal conf that sets an upper limit on the maximum length of the aggregate key/value schema. This list of supported use-cases should be expanded over time. ## How was this patch tested? This is no new change in functionality so existing tests should suffice. Performance tests were done on TPCDS benchmarks. Author: Sameer Agarwal <sameer@databricks.com> Closes #12710 from sameeragarwal/vectorized-enable.
*	[SPARK-12301][ML] Made all tree and ensemble classes not final	Joseph K. Bradley	2016-04-26	8	-16/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? There have been continuing requests (e.g., SPARK-7131) for allowing users to extend and modify MLlib models and algorithms. This PR makes tree and ensemble classes, Node types, and Split types in spark.ml no longer final. This matches most other spark.ml algorithms. Constructors for models are still private since we may need to refactor how stats are maintained in tree nodes. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12711 from jkbradley/final-trees.
*	[SPARK-14514][DOC] Add python example for VectorSlicer	Zheng RuiFeng	2016-04-26	2	-0/+52
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add the missing python example for VectorSlicer ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12282 from zhengruifeng/vecslicer_pe.
*	[SPARK-14907][MLLIB] Use repartition in GLMRegressionModel.save	Dongjoon Hyun	2016-04-26	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR changes `GLMRegressionModel.save` function like the following code that is similar to other algorithms' parquet write. ``` - val dataRDD: DataFrame = sc.parallelize(Seq(data), 1).toDF() - // TODO: repartition with 1 partition after SPARK-5532 gets fixed - dataRDD.write.parquet(Loader.dataPath(path)) + sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(Loader.dataPath(path)) ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12676 from dongjoon-hyun/SPARK-14907.
*	[SPARK-14853] [SQL] Support LeftSemi/LeftAnti in SortMergeJoinExec	Davies Liu	2016-04-26	10	-175/+194
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR update SortMergeJoinExec to support LeftSemi/LeftAnti, so it could support all the join types, same as other three join implementations: BroadcastHashJoinExec, ShuffledHashJoinExec,and BroadcastNestedLoopJoinExec. This PR also simplify the join selection in SparkStrategy. ## How was this patch tested? Added new tests. Author: Davies Liu <davies@databricks.com> Closes #12668 from davies/smj_semi.
*	[SPARK-14903][SPARK-14071][ML][PYTHON] Revert : MLWritable.write property	Joseph K. Bradley	2016-04-26	2	-8/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? SPARK-14071 changed MLWritable.write to be a property. This reverts that change since there was not a good way to make MLReadable.read appear to be a property. ## How was this patch tested? existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12671 from jkbradley/revert-MLWritable-write-py.
*	[SPARK-11559][MLLIB] Make `runs` no effect in mllib.KMeans	Yanbo Liang	2016-04-26	4	-41/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We deprecated ```runs``` of mllib.KMeans in Spark 1.6 (SPARK-11358). In 2.0, we will make it no effect (with warning messages). We did not remove ```setRuns/getRuns``` for better binary compatibility. This PR change `runs` which are appeared at the public API. Usage inside of ```KMeans.runAlgorithm()``` will be resolved at #10806. ## How was this patch tested? Existing unit tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12608 from yanboliang/spark-11559.
*	[MINOR] Follow-up to #12625	Andrew Or	2016-04-26	3	-6/+6
\| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? That patch mistakenly widened the visibility from `private[x]` to `protected[x]`. This patch reverts those changes. Author: Andrew Or <andrew@databricks.com> Closes #12686 from andrewor14/visibility.
*	[SPARK-14912][SQL] Propagate data source options to Hadoop configuration	Reynold Xin	2016-04-26	12	-57/+99
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We currently have no way for users to propagate options to the underlying library that rely in Hadoop configurations to work. For example, there are various options in parquet-mr that users might want to set, but the data source API does not expose a per-job way to set it. This patch propagates the user-specified options also into Hadoop Configuration. ## How was this patch tested? Used a mock data source implementation to test both the read path and the write path. Author: Reynold Xin <rxin@databricks.com> Closes #12688 from rxin/SPARK-14912.
*	[SPARK-14313][ML][SPARKR] AFTSurvivalRegression model persistence in SparkR	Yanbo Liang	2016-04-26	4	-3/+91
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? ```AFTSurvivalRegressionModel``` supports ```save/load``` in SparkR. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12685 from yanboliang/spark-14313.
*	[SPARK-14910][SQL] Native DDL Command Support for Describe Function in ↵	gatorsmile	2016-04-26	4	-3/+79
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Non-identifier Format #### What changes were proposed in this pull request? The existing `Describe Function` only support the function name in `identifier`. This is different from what Hive behaves. That is why many test cases `udf_abc` in `HiveCompatibilitySuite` are not using our native DDL support. For example, - udf_not.q - udf_bitwise_not.q This PR is to resolve the issues. Now, we can support the command of `Describe Function` whose function names are in the following format: - `qualifiedName` (e.g., `db.func1`) - `STRING` (e.g., `'func1'`) - `comparisonOperator` (e.g,. `<`) - `arithmeticOperator` (e.g., `+`) - `predicateOperator` (e.g., `or`) Note, before this PR, we only have a native command support when the function name is in the format of `qualifiedName`. #### How was this patch tested? Added test cases in `DDLSuite.scala`. Also manually verified all the related test cases in `HiveCompatibilitySuite` passed. Author: gatorsmile <gatorsmile@gmail.com> Closes #12679 from gatorsmile/descFunction.
*	[MINOR][DOCS] Minor typo fixes	Jacek Laskowski	2016-04-26	10	-15/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Minor typo fixes (too minor to deserve separate a JIRA) ## How was this patch tested? local build Author: Jacek Laskowski <jacek@japila.pl> Closes #12469 from jaceklaskowski/minor-typo-fixes.
*	[SPARK-14756][CORE] Use parseLong instead of valueOf	Azeem Jiva	2016-04-26	5	-13/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Use Long.parseLong which returns a primative. Use a series of appends() reduces the creation of an extra StringBuilder type ## How was this patch tested? Unit tests Author: Azeem Jiva <azeemj@gmail.com> Closes #12520 from javawithjiva/minor.
*	[SPARK-14889][SPARK CORE] scala.MatchError: NONE (of class ↵	Subhobrata Dey	2016-04-26	2	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	scala.Enumeration) when spark.scheduler.mode=NONE ## What changes were proposed in this pull request? Handling exception for the below mentioned issue ``` ➜ spark git:(master) ✗ ./bin/spark-shell -c spark.scheduler.mode=NONE 16/04/25 09:15:00 ERROR SparkContext: Error initializing SparkContext. scala.MatchError: NONE (of class scala.Enumeration$Val) at org.apache.spark.scheduler.Pool.<init>(Pool.scala:53) at org.apache.spark.scheduler.TaskSchedulerImpl.initialize(TaskSchedulerImpl.scala:131) at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2352) at org.apache.spark.SparkContext.<init>(SparkContext.scala:492) ``` The exception now looks like ``` java.lang.RuntimeException: The scheduler mode NONE is not supported by Spark. ``` ## How was this patch tested? manual tests Author: Subhobrata Dey <sbcd90@gmail.com> Closes #12666 from sbcd90/schedulerModeIssue.
*	Fix dynamic allocation docs to address cached data.	Michael Gummelt	2016-04-26	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Documentation changes ## How was this patch tested? No tests Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #12664 from mgummelt/fix-dynamic-docs.
*	[SPARK-13962][ML] spark.ml Evaluators should support other numeric types for ↵	BenFradet	2016-04-26	8	-51/+88
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	label ## What changes were proposed in this pull request? Made BinaryClassificationEvaluator, MulticlassClassificationEvaluator and RegressionEvaluator accept all numeric types for label ## How was this patch tested? Unit tests Author: BenFradet <benjamin.fradet@gmail.com> Closes #12500 from BenFradet/SPARK-13962.
*	[HOTFIX] Fix the problem for real this time.	Reynold Xin	2016-04-25	1	-5/+5
\|
*	[HOTFIX] Fix compilation	Reynold Xin	2016-04-25	1	-4/+4
\|
*	[SPARK-14861][SQL] Replace internal usages of SQLContext with SparkSession	Andrew Or	2016-04-25	101	-715/+785
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In Spark 2.0, `SparkSession` is the new thing. Internally we should stop using `SQLContext` everywhere since that's supposed to be not the main user-facing API anymore. In this patch I took care to not break any public APIs. The one place that's suspect is `o.a.s.ml.source.libsvm.DefaultSource`, but according to mengxr it's not supposed to be public so it's OK to change the underlying `FileFormat` trait. Reviewers: This is a big patch that may be difficult to review but the changes are actually really straightforward. If you prefer I can break it up into a few smaller patches, but it will delay the progress of this issue a little. ## How was this patch tested? No change in functionality intended. Author: Andrew Or <andrew@databricks.com> Closes #12625 from andrewor14/spark-session-refactor.
*	[SPARK-14904][SQL] Put removed HiveContext in compatibility module	Andrew Or	2016-04-25	3	-0/+170
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This is for users who can't upgrade and need to continue to use HiveContext. ## How was this patch tested? Added some basic tests for sanity check. This is based on #12672 and closes #12672. Author: Andrew Or <andrew@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #12682 from rxin/add-back-hive-context.
*	[SPARK-14870][SQL][FOLLOW-UP] Move decimalDataWithNulls in ↵	Sameer Agarwal	2016-04-25	2	-15/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DataFrameAggregateSuite ## What changes were proposed in this pull request? Minor followup to https://github.com/apache/spark/pull/12651 ## How was this patch tested? Test-only change Author: Sameer Agarwal <sameer@databricks.com> Closes #12674 from sameeragarwal/tpcds-fix-2.
*	[SPARK-14902][SQL] Expose RuntimeConfig in SparkSession	Andrew Or	2016-04-25	38	-81/+131
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `RuntimeConfig` is the new user-facing API in 2.0 added in #11378. Until now, however, it's been dead code. This patch uses `RuntimeConfig` in `SessionState` and exposes that through the `SparkSession`. ## How was this patch tested? New test in `SQLContextSuite`. Author: Andrew Or <andrew@databricks.com> Closes #12669 from andrewor14/use-runtime-conf.
*	[SPARK-14888][SQL] UnresolvedFunction should use FunctionIdentifier	Reynold Xin	2016-04-25	13	-92/+117
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch changes UnresolvedFunction and UnresolvedGenerator to use a FunctionIdentifier rather than just a String for function name. Also changed SessionCatalog to accept FunctionIdentifier in lookupFunction. ## How was this patch tested? Updated related unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12659 from rxin/SPARK-14888.
*	[SPARK-14828][SQL] Start SparkSession in REPL instead of SQLContext	Andrew Or	2016-04-25	6	-50/+65
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? ``` Spark context available as 'sc' (master = local[*], app id = local-1461283768192). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51) Type in expressions to have them evaluated. Type :help for more information. scala> sql("SHOW TABLES").collect() 16/04/21 17:09:39 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 16/04/21 17:09:39 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException res0: Array[org.apache.spark.sql.Row] = Array([src,false]) scala> sql("SHOW TABLES").collect() res1: Array[org.apache.spark.sql.Row] = Array([src,false]) scala> spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))) res2: org.apache.spark.sql.DataFrame = [_1: int, _2: int] ``` Hive things are loaded lazily. ## How was this patch tested? Manual. Author: Andrew Or <andrew@databricks.com> Closes #12589 from andrewor14/spark-session-repl.
*	[SPARK-14312][ML][SPARKR] NaiveBayes model persistence in SparkR	Yanbo Liang	2016-04-25	6	-5/+162
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? SparkR ```NaiveBayesModel``` supports ```save/load``` by the following API: ``` df <- createDataFrame(sqlContext, infert) model <- naiveBayes(education ~ ., df, laplace = 0) ml.save(model, path) model2 <- ml.load(path) ``` ## How was this patch tested? Add unit tests. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #12573 from yanboliang/spark-14312.
*	[SPARK-13739][SQL] Push Predicate Through Window	gatorsmile	2016-04-25	5	-33/+294
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? For performance, predicates can be pushed through Window if and only if the following conditions are satisfied: 1. All the expressions are part of window partitioning key. The expressions can be compound. 2. Deterministic #### How was this patch tested? TODO: - [X] DSL needs to be modified for window - [X] more tests will be added. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11635 from gatorsmile/pushPredicateThroughWindow.
*	[SPARK-14721][SQL] Remove HiveContext (part 2)	Andrew Or	2016-04-25	21	-110/+86
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class. Note: A couple of things will break after this patch. These will be fixed separately. - the python HiveContext - all the documentation / comments referencing HiveContext - there will be no more HiveContext in the REPL (fixed by #12589) ## How was this patch tested? No change in functionality. Author: Andrew Or <andrew@databricks.com> Closes #12585 from andrewor14/delete-hive-context.
*	[SPARK-14731][shuffle]Revert SPARK-12130 to make 2.0 shuffle service ↵	Lianhui Wang	2016-04-25	10	-39/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	compatible with 1.x ## What changes were proposed in this pull request? SPARK-12130 make 2.0 shuffle service incompatible with 1.x. So from discussion: [http://apache-spark-developers-list.1001551.n3.nabble.com/YARN-Shuffle-service-and-its-compatibility-td17222.html](url) we should maintain compatibility between Spark 1.x and Spark 2.x's shuffle service. I put string comparison into executor's register at first avoid string comparison in getBlockData every time. ## How was this patch tested? N/A Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #12568 from lianhuiwang/SPARK-14731.
*	[SPARK-10574][ML][MLLIB] HashingTF supports MurmurHash3	Yanbo Liang	2016-04-25	5	-30/+162
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? As the discussion at [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574), ```HashingTF``` should support MurmurHash3 and make it as the default hash algorithm. We should also expose set/get API for ```hashAlgorithm```, then users can choose the hash method. Note: The problem that ```mllib.feature.HashingTF``` behaves differently between Scala/Java and Python will be resolved in the followup work. ## How was this patch tested? unit tests. cc jkbradley MLnick Author: Yanbo Liang <ybliang8@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #12498 from yanboliang/spark-10574.
*	[SPARK-14892][SQL][TEST] Disable the HiveCompatibilitySuite test case for ↵	gatorsmile	2016-04-25	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	INPUTDRIVER and OUTPUTDRIVER. #### What changes were proposed in this pull request? Disable the test case involving INPUTDRIVER and OUTPUTDRIVER, which are not supported #### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #12662 from gatorsmile/disableInOutDriver.
*	[MINOR][ML][PYTHON][DOC] Remove use of JavaMLWriter/Reader in public Python ↵	Joseph K. Bradley	2016-04-25	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	API docs ## What changes were proposed in this pull request? Removed instances of JavaMLWriter, JavaMLReader appearing in public Python API docs ## How was this patch tested? n/a Author: Joseph K. Bradley <joseph@databricks.com> Closes #12542 from jkbradley/javamlwriter-doc.
*	[SPARK-14433][PYSPARK][ML] PySpark ml GaussianMixture	wm624@hotmail.com	2016-04-25	2	-3/+170
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add Python API in ML for GaussianMixture ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Add doctest and test cases are the same as mllib Python tests ./dev/lint-python PEP8 checks passed. rm -rf _build/* pydoc checks passed. ./python/run-tests --python-executables=python2.7 --modules=pyspark-ml Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log Will test against the following Python executables: ['python2.7'] Will test the following Python modules: ['pyspark-ml'] Finished test(python2.7): pyspark.ml.evaluation (18s) Finished test(python2.7): pyspark.ml.clustering (40s) Finished test(python2.7): pyspark.ml.classification (49s) Finished test(python2.7): pyspark.ml.recommendation (44s) Finished test(python2.7): pyspark.ml.feature (64s) Finished test(python2.7): pyspark.ml.regression (45s) Finished test(python2.7): pyspark.ml.tuning (30s) Finished test(python2.7): pyspark.ml.tests (56s) Tests passed in 106 seconds Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12402 from wangmiao1981/gmm.
*	[SPARK-14744][EXAMPLES] Clean up examples packaging, remove outdated examples.	Marcelo Vanzin	2016-04-25	11	-1125/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	First, make all dependencies in the examples module provided, and explicitly list a couple of ones that somehow are promoted to compile by maven. This means that to run streaming examples, the streaming connector package needs to be provided to run-examples using --packages or --jars, just like regular apps. Also, remove a couple of outdated examples. HBase has had Spark bindings for a while and is even including them in the HBase distribution in the next version, making the examples obsolete. The same applies to Cassandra, which seems to have a proper Spark binding library already. I just tested the build, which passes, and ran SparkPi. The examples jars directory now has only two jars: ``` $ ls -1 examples/target/scala-2.11/jars/ scopt_2.11-3.3.0.jar spark-examples_2.11-2.0.0-SNAPSHOT.jar ``` Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #12544 from vanzin/SPARK-14744.
*	[SPARK-14768][ML][PYSPARK] removed expectedType from Param __init__()	Jason Lee	2016-04-25	1	-8/+1
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Removed expectedType arg from PySpark Param __init__, as suggested by the JIRA. ## How was this patch tested? Manually looked through all places that use Param. Compiled and ran all ML PySpark test cases before and after the fix. Author: Jason Lee <cjlee@us.ibm.com> Closes #12581 from jasoncl/SPARK-14768.
*	[SPARK-14875][SQL] Makes OutputWriterFactory.newInstance public	Cheng Lian	2016-04-25	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This method was accidentally made `private[sql]` in Spark 2.0. This PR makes it public again, since 3rd party data sources like spark-avro depend on it. ## How was this patch tested? N/A Author: Cheng Lian <lian@databricks.com> Closes #12652 from liancheng/spark-14875.
*	[SPARK-14636] Add minimum memory checks for drivers and executors	Peter Ableda	2016-04-25	1	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Implement the same memory size validations for the StaticMemoryManager (Legacy) as the UnifiedMemoryManager has. ## How was this patch tested? Manual tests were done in CDH cluster. Test with small executor memory: ` spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode client --master yarn --executor-memory 15m --conf spark.memory.useLegacyMode=true /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples*.jar 10 ` Exception thrown: ``` ERROR spark.SparkContext: Error initializing SparkContext. java.lang.IllegalArgumentException: Executor memory 15728640 must be at least 471859200. Please increase executor memory using the --executor-memory option or spark.executor.memory in Spark configuration. at org.apache.spark.memory.StaticMemoryManager$.org$apache$spark$memory$StaticMemoryManager$$getMaxExecutionMemory(StaticMemoryManager.scala:127) at org.apache.spark.memory.StaticMemoryManager.<init>(StaticMemoryManager.scala:46) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:352) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:289) at org.apache.spark.SparkContext.<init>(SparkContext.scala:462) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:29) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` Author: Peter Ableda <peter.ableda@cloudera.com> Closes #12395 from peterableda/SPARK-14636.
*	[SPARK-14758][ML] Add checking for StepSize and Tol	Zheng RuiFeng	2016-04-25	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? add the checking for StepSize and Tol in sharedParams ## How was this patch tested? Unit tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12530 from zhengruifeng/ml_args_checking.
*	[SPARK-14790] Always run scalastyle on sbt compile and test	Eric Liang	2016-04-25	2	-1/+84
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Sbt compile and test should also run scalastyle. This makes it less likely you forget to run scalastyle and fail in jenkins. Scalastyle results are cached for efficiency. This patch was originally written by ahirreddy; I just fixed it up to work with scalastyle 0.8.0. ## How was this patch tested? Tested manually with `build/sbt package`. Author: Eric Liang <ekl@databricks.com> Closes #12555 from ericl/scalastyle.
*	[SPARK-14870] [SQL] Fix NPE in TPCDS q14a	Sameer Agarwal	2016-04-24	4	-3/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR fixes a bug in `TungstenAggregate` that manifests while aggregating by keys over nullable `BigDecimal` columns. This causes a null pointer exception while executing TPCDS q14a. ## How was this patch tested? 1. Added regression test in `DataFrameAggregateSuite`. 2. Verified that TPCDS q14a works Author: Sameer Agarwal <sameer@databricks.com> Closes #12651 from sameeragarwal/tpcds-fix.
*	[SPARK-14881] [PYTHON] [SPARKR] pyspark and sparkR shell default log level ↵	felixcheung	2016-04-24	2	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	should match spark-shell/Scala ## What changes were proposed in this pull request? Change default logging to WARN for pyspark shell and sparkR shell for a much cleaner environment. ## How was this patch tested? Manually running pyspark and sparkR shell Author: felixcheung <felixcheung_m@hotmail.com> Closes #12648 from felixcheung/pylogging.