spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-13928] Move org.apache.spark.Logging into ↵	Wenchen Fan	2016-03-17	398	-344/+576
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	org.apache.spark.internal.Logging ## What changes were proposed in this pull request? Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11764 from cloud-fan/logger.
*	[SPARK-13901][CORE] correct the logDebug information when jump to the next ↵	trueyao	2016-03-17	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	locality level JIRA Issue:https://issues.apache.org/jira/browse/SPARK-13901 In getAllowedLocalityLevel method of TaskSetManager,we get wrong logDebug information when jump to the next locality level.So we should fix it. Author: trueyao <501663994@qq.com> Closes #11719 from trueyao/logDebug-localityWait.
*	[SPARK-13629][ML] Add binary toggle Param to CountVectorizer	Yuhao Yang	2016-03-17	2	-2/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? It would be handy to add a binary toggle Param to CountVectorizer, as in the scikit-learn one: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html If set, then all non-zero counts will be set to 1. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11536 from hhbyyh/cvToggle.
*	[MINOR][DOC] Add JavaStreamingTestExample	Zheng RuiFeng	2016-03-17	2	-0/+128
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add the java example of StreamingTest ## How was this patch tested? manual tests in CLI: bin/run-example mllib.JavaStreamingTestExample dataDir 5 100 Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11776 from zhengruifeng/streaming_je.
*	Revert "[SPARK-13840][SQL] Split Optimizer Rule ColumnPruning to ↵	Davies Liu	2016-03-16	4	-26/+14
\| \| \| \| \| \|	ColumnPruning and EliminateOperator" This reverts commit 99bd2f0e94657687834c5c59c4270c1484c9f595.
*	[SPARK-13948] MiMa check should catch if the visibility changes to private	Josh Rosen	2016-03-16	4	-26/+42
\| \| \| \| \| \| \| \| \| \|	MiMa excludes are currently generated using both the current Spark version's classes and Spark 1.2.0's classes, but this doesn't make sense: we should only be ignoring classes which were `private` in the previous Spark version, not classes which became private in the current version. This patch updates `dev/mima` to only generate excludes with respect to the previous artifacts that MiMa checks against. It also updates `MimaBuild` so that `excludeClass` only applies directly to the class being excluded and not to its companion object (since a class and its companion object can have different accessibility). Author: Josh Rosen <joshrosen@databricks.com> Closes #11774 from JoshRosen/SPARK-13948.
*	[SPARK-13403][SQL] Pass hadoopConfiguration to HiveConf constructors.	Ryan Blue	2016-03-16	5	-5/+34
\| \| \| \| \| \| \| \| \| \|	This commit updates the HiveContext so that sc.hadoopConfiguration is used to instantiate its internal instances of HiveConf. I tested this by overriding the S3 FileSystem implementation from spark-defaults.conf as "spark.hadoop.fs.s3.impl" (to avoid [HADOOP-12810](https://issues.apache.org/jira/browse/HADOOP-12810)). Author: Ryan Blue <blue@apache.org> Closes #11273 from rdblue/SPARK-13403-new-hive-conf-from-hadoop-conf.
*	[SPARK-13926] Automatically use Kryo serializer when shuffling RDDs with ↵	Josh Rosen	2016-03-16	21	-73/+135
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	simple types Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are known to be compatible with Kryo. This patch introduces `SerializerManager`, a component which picks the "best" serializer for a shuffle given the elements' ClassTags. It will automatically pick a Kryo serializer for ShuffledRDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings. In the future we can use this class as a narrow extension point to integrate specialized serializers for other types, such as ByteBuffers. In a planned followup patch, I will extend the BlockManager APIs so that we're able to use similar automatic serializer selection when caching RDDs (this is a little trickier because the ClassTags need to be threaded through many more places). Author: Josh Rosen <joshrosen@databricks.com> Closes #11755 from JoshRosen/automatically-pick-best-serializer.
*	[SPARK-12855][MINOR][SQL][DOC][TEST] remove spark.sql.dialect from doc and test	Daoyuan Wang	2016-03-16	3	-14/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Since developer API of plug-able parser has been removed in #10801 , docs should be updated accordingly. ## How was this patch tested? This patch will not affect the real code path. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #11758 from adrian-wang/spark12855.
*	[MINOR][SQL][BUILD] Remove duplicated lines	Dongjoon Hyun	2016-03-16	2	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR removes three minor duplicated lines. First one is making the following unreachable code warning. ``` JoinSuite.scala:52: unreachable code [warn] case j: BroadcastHashJoin => j ``` The other two are just consecutive repetitions in `Seq` of MiMa filters. ## How was this patch tested? Pass the existing Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11773 from dongjoon-hyun/remove_duplicated_line.
*	[SPARK-13118][SQL] Expression encoding for optional synthetic classes	Jakob Odersky	2016-03-16	3	-3/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix expression generation for optional types. Standard Java reflection causes issues when dealing with synthetic Scala objects (things that do not map to Java and thus contain a dollar sign in their name). This patch introduces Scala reflection in such cases. This patch also adds a regression test for Dataset's handling of classes defined in package objects (which was the initial purpose of this PR). ## How was this patch tested? A new test in ExpressionEncoderSuite that tests optional inner classes and a regression test for Dataset's handling of package objects. Author: Jakob Odersky <jakob@odersky.com> Closes #11708 from jodersky/SPARK-13118-package-objects.
*	[SPARK-13873] [SQL] Avoid copy of UnsafeRow when there is no join in whole ↵	Davies Liu	2016-03-16	9	-8/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	stage codegen ## What changes were proposed in this pull request? We need to copy the UnsafeRow since a Join could produce multiple rows from single input rows. We could avoid that if there is no join (or the join will not produce multiple rows) inside WholeStageCodegen. Updated the benchmark for `collect`, we could see 20-30% speedup. ## How was this patch tested? existing unit tests. Author: Davies Liu <davies@databricks.com> Closes #11740 from davies/avoid_copy2.
*	[SPARK-13719][SQL] Parse JSON rows having an array type and a struct type in ↵	hyukjinkwon	2016-03-16	4	-14/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	the same fieild ## What changes were proposed in this pull request? This https://github.com/apache/spark/pull/2400 added the support to parse JSON rows wrapped with an array. However, this throws an exception when the given data contains array data and struct data in the same field as below: ```json {"a": {"b": 1}} {"a": []} ``` and the schema is given as below: ```scala val schema = StructType( StructField("a", StructType( StructField("b", StringType) :: Nil )) :: Nil) ``` - Before ```scala sqlContext.read.schema(schema).json(path).show() ``` ```scala Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) ... ``` - After ```scala sqlContext.read.schema(schema).json(path).show() ``` ```bash +----+ \| a\| +----+ \| [1]\| \|null\| +----+ ``` For other data types, in this case it converts the given values are `null` but only this case emits an exception. This PR makes the support for wrapped rows applied only at the top level. ## How was this patch tested? Unit tests were used and `./dev/run_tests` for code style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11752 from HyukjinKwon/SPARK-3308-follow-up.
*	[SPARK-13923][SQL] Implement SessionCatalog	Andrew Or	2016-03-16	19	-202/+1604
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? As part of the effort to merge `SQLContext` and `HiveContext`, this patch implements an internal catalog called `SessionCatalog` that handles temporary functions and tables and delegates metastore operations to `ExternalCatalog`. Currently, this is still dead code, but in the future it will be part of `SessionState` and will replace `o.a.s.sql.catalyst.analysis.Catalog`. A recent patch #11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use `SessionCatalog` to implement the parsed commands. ## How was this patch tested? 800+ lines of tests in `SessionCatalogSuite`. Author: Andrew Or <andrew@databricks.com> Closes #11750 from andrewor14/temp-catalog.
*	[SPARK-13761][ML] Deprecate validateParams	Yuhao Yang	2016-03-16	33	-89/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Deprecate validateParams() method here: https://github.com/apache/spark/blob/035d3acdf3c1be5b309a861d5c5beb803b946b5e/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L553 Move all functionality in overridden methods to transformSchema(). Check docs to make sure they indicate complex Param interaction checks should be done in transformSchema. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11620 from hhbyyh/depreValid.
*	[SPARK-11011][SQL] Narrow type of UDT serialization	Jakob Odersky	2016-03-16	10	-52/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`, however it would logically make more sense to narrow it down to the type of the actual user defined type. ## How was this patch tested? Existing tests were successfully run on local machine. Author: Jakob Odersky <jakob@odersky.com> Closes #11379 from jodersky/SPARK-11011-udt-types.
*	[SPARK-13869][SQL] Remove redundant conditions while combining filters	Sameer Agarwal	2016-03-16	2	-4/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? [I'll link it to the JIRA once ASF JIRA is back online] This PR modifies the existing `CombineFilters` rule to remove redundant conditions while combining individual filter predicates. For instance, queries of the form `table.where('a === 1 && 'b === 1).where('a === 1 && 'c === 1)` will now be optimized to ` table.where('a === 1 && 'b === 1 && 'c === 1)` (instead of ` table.where('a === 1 && 'a === 1 && 'b === 1 && 'c === 1)`) ## How was this patch tested? Unit test in `FilterPushdownSuite` Author: Sameer Agarwal <sameer@databricks.com> Closes #11670 from sameeragarwal/combine-filters.
*	[SPARK-13871][SQL] Support for inferring filters from data constraints	Sameer Agarwal	2016-03-16	2	-62/+63
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR generalizes the `NullFiltering` optimizer rule in catalyst to `InferFiltersFromConstraints` that can automatically infer all relevant filters based on an operator's constraints while making sure of 2 things: (a) no redundant filters are generated, and (b) filters that do not contribute to any further optimizations are not generated. ## How was this patch tested? Extended all tests in `InferFiltersFromConstraintsSuite` (that were initially based on `NullFilteringSuite` to test filter inference in `Filter` and `Join` operators. In particular the 2 tests ( `single inner join with pre-existing filters: filter out values on either side` and `multiple inner joins: filter out values on all sides on equi-join keys` attempts to highlight/test the real potential of this rule for join optimization. Author: Sameer Agarwal <sameer@databricks.com> Closes #11665 from sameeragarwal/infer-filters.
*	[SPARK-13922][SQL] Filter rows with null attributes in vectorized parquet reader	Sameer Agarwal	2016-03-16	3	-5/+146
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	# What changes were proposed in this pull request? It's common for many SQL operators to not care about reading `null` values for correctness. Currently, this is achieved by performing `isNotNull` checks (for all relevant columns) on a per-row basis. Pushing these null filters in the vectorized parquet reader should bring considerable benefits (especially for cases when the underlying data doesn't contain any nulls or contains all nulls). ## How was this patch tested? Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz String with Nulls Scan (0%): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- SQL Parquet Vectorized 1229 / 1648 8.5 117.2 1.0X PR Vectorized 833 / 846 12.6 79.4 1.5X PR Vectorized (Null Filtering) 732 / 782 14.3 69.8 1.7X Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz String with Nulls Scan (50%): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- SQL Parquet Vectorized 995 / 1053 10.5 94.9 1.0X PR Vectorized 732 / 772 14.3 69.8 1.4X PR Vectorized (Null Filtering) 725 / 790 14.5 69.1 1.4X Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz String with Nulls Scan (95%): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- SQL Parquet Vectorized 326 / 333 32.2 31.1 1.0X PR Vectorized 190 / 200 55.1 18.2 1.7X PR Vectorized (Null Filtering) 168 / 172 62.2 16.1 1.9X Author: Sameer Agarwal <sameer@databricks.com> Closes #11749 from sameeragarwal/perf-testing.
*	[SPARK-13942][CORE][DOCS] Remove Shark-related docs for 2.x	Dongjoon Hyun	2016-03-16	1	-45/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `Shark` was merged into `Spark SQL` since [July 2014](https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html). The followings seem to be the only legacy. For Spark 2.x, we had better clean up those docs. Migration Guide ``` - ## Migration Guide for Shark Users - ... - ### Scheduling - ... - ### Reducer number - ... - ### Caching ``` ## How was this patch tested? Pass the Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11770 from dongjoon-hyun/SPARK-13942.
*	[SPARK-13034] PySpark ml.classification support export/import	GayathriMurali	2016-03-16	2	-9/+61
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/classification.py. ## How was this patch tested? ./python/run-tests ./dev/lint-python Unit tests added to check persistence in Logistic Regression Author: GayathriMurali <gayathri.m.softie@gmail.com> Closes #11707 from GayathriMurali/SPARK-13034.
*	[SPARK-13927][MLLIB] add row/column iterator to local matrices	Xiangrui Meng	2016-03-16	3	-1/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add row/column iterator to local matrices to simplify tasks like BlockMatrix => RowMatrix conversion. It handles dense and sparse matrices properly. ## How was this patch tested? Unit tests on sparse and dense matrix. cc: dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #11757 from mengxr/SPARK-13927.
*	[SPARK-11888][ML] Decision tree persistence in spark.ml	Joseph K. Bradley	2016-03-16	23	-71/+428
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	### What changes were proposed in this pull request? Made these MLReadable and MLWritable: DecisionTreeClassifier, DecisionTreeClassificationModel, DecisionTreeRegressor, DecisionTreeRegressionModel * The shared implementation is in treeModels.scala * I use case classes to create a DataFrame to save, and I use the Dataset API to parse loaded files. Other changes: * Made CategoricalSplit.numCategories public (to use in persistence) * Fixed a bug in DefaultReadWriteTest.testEstimatorAndModelReadWrite, where it did not call the checkModelData function passed as an argument. This caused an error in LDASuite, which I fixed. ### How was this patch tested? Persistence is tested via unit tests. For each algorithm, there are 2 non-trivial trees (depth 2). One is built with continuous features, and one with categorical; this ensures that both types of splits are tested. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11581 from jkbradley/dt-io.
*	[SPARK-13613][ML] Provide ignored tests to export test dataset into CSV format	Yanbo Liang	2016-03-16	4	-33/+97
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Provide ignored test cases to export the test dataset into CSV format in ```LinearRegressionSuite```, ```LogisticRegressionSuite```, ```AFTSurvivalRegressionSuite``` and ```GeneralizedLinearRegressionSuite```, so users can validate the training accuracy compared with R's glm, glmnet and survival package. cc mengxr ## How was this patch tested? The test suite is ignored, but I have enabled all these cases offline and it works as expected. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11463 from yanboliang/spark-13613.
*	[SPARK-13038][PYSPARK] Add load/save to pipeline	Xusen Yin	2016-03-16	6	-89/+317
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? JIRA issue: https://issues.apache.org/jira/browse/SPARK-13038 1. Add load/save to PySpark Pipeline and PipelineModel 2. Add `_transfer_stage_to_java()` and `_transfer_stage_from_java()` for `JavaWrapper`. ## How was this patch tested? Test with doctest. Author: Xusen Yin <yinxusen@gmail.com> Closes #11683 from yinxusen/SPARK-13038-only.
*	[SPARK-12721][SQL] SQL Generation for Script Transformation	gatorsmile	2016-03-16	3	-0/+134
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? This PR is to convert to SQL from analyzed logical plans containing operator `ScriptTransformation`. For example, below is the SQL containing `Transform` ``` SELECT TRANSFORM (a, b, c, d) USING 'cat' FROM parquet_t2 ``` Its logical plan is like ``` ScriptTransformation [a#210L,b#211L,c#212L,d#213L], cat, [key#208,value#209], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),true) +- SubqueryAlias parquet_t2 +- Relation[a#210L,b#211L,c#212L,d#213L] ParquetRelation ``` The generated SQL will be like ``` SELECT TRANSFORM (`parquet_t2`.`a`, `parquet_t2`.`b`, `parquet_t2`.`c`, `parquet_t2`.`d`) USING 'cat' AS (`key` string, `value` string) FROM `default`.`parquet_t2` ``` #### How was this patch tested? Seven test cases are added to `LogicalPlanToSQLSuite`. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11503 from gatorsmile/transformToSQL.
*	[SPARK-13827][SQL] Can't add subquery to an operator with same-name outputs ↵	Wenchen Fan	2016-03-16	3	-103/+147
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	while generate SQL string ## What changes were proposed in this pull request? This PR tries to solve a fundamental issue in the `SQLBuilder`. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. the `Join` operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs. To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore. For example, `SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y`, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like: ``` SELECT sq_1.key, sq_1.max FROM ( SELECT sq_0.key, sq_0.key, MAX(sq_0.key) OVER () AS max FROM ( SELECT x.key, y.key FROM t1 AS x JOIN t2 AS y ) AS sq_0 ) AS sq_1 ``` You can see, the `key` columns become ambiguous after `sq_0`. After this PR, it will generate something like: ``` SELECT attr_30 AS key, attr_37 AS max FROM ( SELECT attr_30, attr_37 FROM ( SELECT attr_30, attr_35, MAX(attr_35) AS attr_37 FROM ( SELECT attr_30, attr_35 FROM (SELECT key AS attr_30 FROM t1) AS sq_0 INNER JOIN (SELECT key AS attr_35 FROM t1) AS sq_1 ) AS sq_2 ) AS sq_3 ) AS sq_4 ``` The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore. ## How was this patch tested? existing tests and new tests in LogicalPlanToSQLSuite. Author: Wenchen Fan <wenchen@databricks.com> Closes #11658 from cloud-fan/gensql.
*	[SPARK-13816][GRAPHX] Add parameter checks for algorithms in Graphx	Zheng RuiFeng	2016-03-16	6	-2/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-13816 ## What changes were proposed in this pull request? Add parameter checks for algorithms in Graphx: Pregel,LabelPropagation,PageRank,SVDPlusPlus ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11655 from zhengruifeng/graphx_param_check.
*	[SPARK-13894][SQL] SqlContext.range return type from DataFrame to DataSet	Cheng Hao	2016-03-16	12	-37/+38
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13894 Change the return type of the `SQLContext.range` API from `DataFrame` to `Dataset`. ## How was this patch tested? No additional unit test required. Author: Cheng Hao <hao.cheng@intel.com> Closes #11730 from chenghao-intel/range.
*	[SPARK-13924][SQL] officially support multi-insert	Wenchen Fan	2016-03-16	3	-20/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? There is a feature of hive SQL called multi-insert. For example: ``` FROM src INSERT OVERWRITE TABLE dest1 SELECT key + 1 INSERT OVERWRITE TABLE dest2 SELECT key WHERE key > 2 INSERT OVERWRITE TABLE dest3 SELECT col EXPLODE(arr) exp AS col ... ``` We partially support it currently, with some limitations: 1) WHERE can't reference columns produced by LATERAL VIEW. 2) It's not executed eagerly, i.e. `sql("...multi-insert clause...")` won't take place right away like other commands, e.g. CREATE TABLE. This PR removes these limitations and make us fully support multi-insert. ## How was this patch tested? new tests in `SQLQuerySuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11754 from cloud-fan/lateral-view.
*	[SPARK-13360][PYSPARK][YARN] PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON…	Jeff Zhang	2016-03-16	1	-0/+3
\| \| \| \| \| \| \| \|	… is not picked up in yarn-cluster mode Author: Jeff Zhang <zjffdu@apache.org> Closes #11238 from zjffdu/SPARK-13360.
*	[SPARK-13281][CORE] Switch broadcast of RDD to exception from warning	Wesley Tang	2016-03-16	2	-6/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In SparkContext, throw Illegalargumentexception when trying to broadcast rdd directly, instead of logging the warning. ## How was this patch tested? mvn clean install Add UT in BroadcastSuite Author: Wesley Tang <tangmingjun@mininglamp.com> Closes #11735 from breakdawn/master.
*	[SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succeeds to ↵	Sean Owen	2016-03-16	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	fix failure on slow machines ## What changes were proposed in this pull request? I'm seeing several PR builder builds fail after https://github.com/apache/spark/pull/11725/files. Example: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.4/lastFailedBuild/console ``` testCommunication(org.apache.spark.launcher.LauncherServerSuite) Time elapsed: 0.023 sec <<< FAILURE! java.lang.AssertionError: expected:<app-id> but was:<null> at org.apache.spark.launcher.LauncherServerSuite.testCommunication(LauncherServerSuite.java:93) ``` However, other builds pass this same test, including the test when run locally and on the Jenkins PR builder. The failure itself concerns a change to how the test waits on a condition, and the wait can time out; therefore I think this is due to fast/slow machine differences. This is an attempt at a hot fix; it's a little hard to verify since locally and on the PR builder, it passes anyway. The change itself should be harmless anyway. Why didn't this happen before, if the new logic was supposed to be equivalent to the old? I think this is the sequence: - First attempt to acquire semaphore for 10ms actually silently times out - The changed being waited for happens just after that, a bit too late - Assertion passes since condition became true just in time - `release()` fires from the listener - Next `tryAcquire` however immediately succeeds because the first `tryAcquire` didn't acquire anything, but its subsequent condition is not yet true; this would explain why the second one always fails Versus the original using `notifyAll()`, there's a small difference: `wait()`-ing after `notifyAll()` just results in another wait; it doesn't make it return immediately. So this was a tiny latent issue that was masked by the semantics. Now the test asserts that the event actually happened (semaphore was acquired). (The timeout is still here to prevent the test from hanging forever, and to detect really slow response.) The timeout is increased to a second to allow plenty of time anyway. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11763 from srowen/SPARK-13823.3.
*	[SPARK-13889][YARN] Fix integer overflow when calculating the max number of ↵	Carson Wang	2016-03-16	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	executor failure ## What changes were proposed in this pull request? The max number of executor failure before failing the application is default to twice the maximum number of executors if dynamic allocation is enabled. The default value for "spark.dynamicAllocation.maxExecutors" is Int.MaxValue. So this causes an integer overflow and a wrong result. The calculated value of the default max number of executor failure is 3. This PR adds a check to avoid the overflow. ## How was this patch tested? It tests if the value is greater that Int.MaxValue / 2 to avoid the overflow when it multiplies 2. Author: Carson Wang <carson.wang@intel.com> Closes #11713 from carsonwang/IntOverflow.
*	[SPARK-13793][CORE] PipedRDD doesn't propagate exceptions while reading ↵	Tejas Patil	2016-03-16	2	-32/+86
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	parent RDD ## What changes were proposed in this pull request? PipedRDD creates a child thread to read output of the parent stage and feed it to the pipe process. Used a variable to save the exception thrown in the child thread and then propagating the exception in the main thread if the variable was set. ## How was this patch tested? - Added a unit test - Ran all the existing tests in PipedRDDSuite and they all pass with the change - Tested the patch with a real pipe() job, bounced the executor node which ran the parent stage to simulate a fetch failure and observed that the parent stage was re-ran. Author: Tejas Patil <tejasp@fb.com> Closes #11628 from tejasapatil/pipe_rdd.
*	[SPARK-13396] Stop using our internal deprecated .metrics on Exceptio…	GayathriMurali	2016-03-16	1	-8/+14
\| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-13396 Stop using our internal deprecated .metrics on ExceptionFailure instead use accumUpdates Author: GayathriMurali <gayathri.m.softie@gmail.com> Closes #11544 from GayathriMurali/SPARK-13396.
*	[SPARK-13823][SPARK-13397][SPARK-13395][CORE] More warnings, StandardCharset ↵	Sean Owen	2016-03-16	41	-184/+178
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	follow up ## What changes were proposed in this pull request? Follow up to https://github.com/apache/spark/pull/11657 - Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8` - And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests) - And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11725 from srowen/SPARK-13823.2.
*	[SPARK-13906] Ensure that there are at least 2 dispatcher threads.	Yonathan Randolph	2016-03-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Force at least two dispatcher-event-loop threads. Since SparkDeploySchedulerBackend (in AppClient) calls askWithRetry to CoarseGrainedScheduler in the same process, there the driver needs at least two dispatcher threads to prevent the dispatcher thread from hanging. ## How was this patch tested? Manual. Author: Yonathan Randolph <yonathangmail.com> Author: Yonathan Randolph <yonathan@liftigniter.com> Closes #11728 from yonran/SPARK-13906.
*	[SPARK-12653][SQL] Re-enable test "SPARK-8489: MissingRequirementError ↵	Dongjoon Hyun	2016-03-16	4	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	during reflection" ## What changes were proposed in this pull request? The purpose of [SPARK-12653](https://issues.apache.org/jira/browse/SPARK-12653) is re-enabling a regression test. Historically, the target regression test is added by [SPARK-8498](https://github.com/apache/spark/commit/093c34838d1db7a9375f36a9a2ab5d96a23ae683), but is temporarily disabled by [SPARK-12615](https://github.com/apache/spark/commit/8ce645d4eeda203cf5e100c4bdba2d71edd44e6a) due to binary compatibility error. The following is the current error message at the submitting spark job with the pre-built `test.jar` file in the target regression test. ``` Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.SparkContext$.$lessinit$greater$default$6()Lscala/collection/Map; ``` Simple rebuilding `test.jar` can not recover the purpose of testcase since we need to support both Scala 2.10 and 2.11 for a while. For example, we will face the following Scala 2.11 error if we use `test.jar` built by Scala 2.10. ``` Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror; ``` This PR replace the existing `test.jar` with `test-2.10.jar` and `test-2.11.jar` and improve the regression test to use the suitable jar file. ## How was this patch tested? Pass the existing Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11744 from dongjoon-hyun/SPARK-12653.
*	[SPARK-13899][SQL] Produce InternalRow instead of external Row at CSV data ↵	hyukjinkwon	2016-03-15	4	-22/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	source ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13899 This PR makes CSV data source produce `InternalRow` instead of `Row`. Basically, this resembles JSON data source. It uses the same codes for casting. ## How was this patch tested? Unit tests were used within IDE and code style was checked by `./dev/run_tests`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11717 from HyukjinKwon/SPARK-13899.
*	[SPARK-13920][BUILD] MIMA checks should apply to @Experimental and ↵	Dongjoon Hyun	2016-03-15	2	-19/+214
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	@DeveloperAPI APIs ## What changes were proposed in this pull request? We are able to change `Experimental` and `DeveloperAPI` API freely but also should monitor and manage those API carefully. This PR for [SPARK-13920](https://issues.apache.org/jira/browse/SPARK-13920) enables MiMa check and adds filters for them. ## How was this patch tested? Pass the Jenkins tests (including MiMa). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11751 from dongjoon-hyun/SPARK-13920.
*	[SPARK-9837][ML] R-like summary statistics for GLMs via iteratively ↵	Yanbo Liang	2016-03-15	3	-11/+796
\| \| \| \| \| \| \| \| \| \| \| \| \|	reweighted least squares ## What changes were proposed in this pull request? Provide R-like summary statistics for GLMs via iteratively reweighted least squares. ## How was this patch tested? unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11694 from yanboliang/spark-9837.
*	[SPARK-13917] [SQL] generate broadcast semi join	Davies Liu	2016-03-15	11	-139/+124
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR brings codegen support for broadcast left-semi join. ## How was this patch tested? Existing tests. Added benchmark, the result show 7X speedup. Author: Davies Liu <davies@databricks.com> Closes #11742 from davies/gen_semi.
*	[MINOR][TEST][SQL] Remove wrong "expected" parameter in checkNaNWithoutCodegen	Yucai Yu	2016-03-15	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Remove the wrong "expected" parameter in MathFunctionsSuite.scala's checkNaNWithoutCodegen. This function is to check NaN value, so the "expected" parameter is useless. The Callers do not pass "expected" value and the similar function like checkNaNWithGeneratedProjection and checkNaNWithOptimization do not use it also. Author: Yucai Yu <yucai.yu@intel.com> Closes #11718 from yucai/unused_expected.
*	[SPARK-13918][SQL] Merge SortMergeJoin and SortMergerOuterJoin	Davies Liu	2016-03-15	9	-535/+467
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR just move some code from SortMergeOuterJoin into SortMergeJoin. This is for support codegen for outer join. ## How was this patch tested? existing tests. Author: Davies Liu <davies@databricks.com> Closes #11743 from davies/gen_smjouter.
*	[SPARK-13895][SQL] DataFrameReader.text should return Dataset[String]	Reynold Xin	2016-03-15	3	-12/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch changes DataFrameReader.text()'s return type from DataFrame to Dataset[String]. Closes #11731. ## How was this patch tested? Updated existing integration tests to reflect the change. Author: Reynold Xin <rxin@databricks.com> Closes #11739 from rxin/SPARK-13895.
*	[SPARK-13626][CORE] Revert change to SparkConf's constructor.	Marcelo Vanzin	2016-03-15	1	-1/+1
\| \| \| \| \| \| \| \|	It shouldn't be private. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11734 from vanzin/SPARK-13626-api.
*	[MINOR] a minor fix for the comments of a method in RPC Dispatcher	CodingCat	2016-03-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? a minor fix for the comments of a method in RPC Dispatcher ## How was this patch tested? existing unit tests Author: CodingCat <zhunansjtu@gmail.com> Closes #11738 from CodingCat/minor_rpc.
*	[SPARK-13896][SQL][STRING] Dataset.toJSON should return Dataset	Stavros Kontopoulos	2016-03-15	3	-8/+10
\| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Change the return type of toJson in Dataset class ## How was this patch tested? No additional unit test required. Author: Stavros Kontopoulos <stavros.kontopoulos@typesafe.com> Closes #11732 from skonto/fix_toJson.
*	[SPARK-13642][YARN] Changed the default application exit state to failed for ↵	jerryshao	2016-03-15	1	-7/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	yarn cluster mode ## What changes were proposed in this pull request? Changing the default exit state to `failed` for any application running on yarn cluster mode. ## How was this patch tested? Unit test is done locally. CC tgravescs and vanzin . Author: jerryshao <sshao@hortonworks.com> Closes #11693 from jerryshao/SPARK-13642.