spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-15396][SQL][DOC] It can't connect hive metastore database	gatorsmile	2016-05-21	2	-33/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? The `hive.metastore.warehouse.dir` property in hive-site.xml is deprecated since Spark 2.0.0. Users might not be able to connect to the existing metastore if they do not use the new conf parameter `spark.sql.warehouse.dir`. This PR is to update the document and example for explaining the latest changes in the configuration of default location of database. Below is the screenshot of the latest generated docs: <img width="681" alt="screenshot 2016-05-20 08 38 10" src="https://cloud.githubusercontent.com/assets/11567269/15433296/a05c4ace-1e66-11e6-8d2b-73682b32e9c2.png"> <img width="789" alt="screenshot 2016-05-20 08 53 26" src="https://cloud.githubusercontent.com/assets/11567269/15433734/645dc42e-1e68-11e6-9476-effc9f8721bb.png"> <img width="789" alt="screenshot 2016-05-20 08 53 37" src="https://cloud.githubusercontent.com/assets/11567269/15433738/68569f92-1e68-11e6-83d3-ef5bb221a8d8.png"> No change is made in the R's example. <img width="860" alt="screenshot 2016-05-20 08 54 38" src="https://cloud.githubusercontent.com/assets/11567269/15433779/965b8312-1e68-11e6-8bc4-53c88ceacde2.png"> #### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #13225 from gatorsmile/document.
*	[SPARK-15415][SQL] Fix BroadcastHint when autoBroadcastJoinThreshold is 0 or -1	Jurriaan Pruis	2016-05-21	5	-26/+114
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR makes BroadcastHint more deterministic by using a special isBroadcastable property instead of setting the sizeInBytes to 1. See https://issues.apache.org/jira/browse/SPARK-15415 ## How was this patch tested? Added testcases to test if the broadcast hash join is included in the plan when the BroadcastHint is supplied and also tests for propagation of the joins. Author: Jurriaan Pruis <email@jurriaanpruis.nl> Closes #13244 from jurriaan/broadcast-hint.
*	[SPARK-15206][SQL] add testcases for distinct aggregate in having clause	xin Wu	2016-05-21	1	-0/+31
\| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add new test cases for including distinct aggregate in having clause in 2.0 branch. This is a followup PR for [#12974](https://github.com/apache/spark/pull/12974), which is for 1.6 branch. Author: xin Wu <xinwu@us.ibm.com> Closes #12984 from xwu0226/SPARK-15206.
*	[SPARK-15330][SQL] Implement Reset Command	gatorsmile	2016-05-21	5	-5/+82
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? Like `Set` Command in Hive, `Reset` is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-3202 This PR is to implement such a command for resetting the SQL-related configuration to the default values. One of the use case shown in HIVE-3202 is listed below: > For the purpose of optimization we set various configs per query. It's worthy but all those configs should be reset every time for next query. #### How was this patch tested? Added a test case. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13121 from gatorsmile/resetCommand.
*	[SPARK-15280] Input/Output] Refactored OrcOutputWriter and moved ↵	Ergin Seyfe	2016-05-21	1	-39/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	serialization to a new class. ## What changes were proposed in this pull request? Refactoring: Separated ORC serialization logic from OrcOutputWriter and moved to a new class called OrcSerializer. ## How was this patch tested? Manual tests & existing tests. Author: Ergin Seyfe <eseyfe@fb.com> Closes #13066 from seyfe/orc_serializer.
*	[SPARK-15452][SQL] Mark aggregator API as experimental	Reynold Xin	2016-05-21	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The Aggregator API was introduced in 2.0 for Dataset. All typed Dataset APIs should still be marked as experimental in 2.0. ## How was this patch tested? N/A - annotation only change. Author: Reynold Xin <rxin@databricks.com> Closes #13226 from rxin/SPARK-15452.
*	[SPARK-15114][SQL] Column name generated by typed aggregate is super verbose	Dilip Biswal	2016-05-21	5	-5/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Generate a shorter default alias for `AggregateExpression `, In this PR, aggregate function name along with a index is used for generating the alias name. ```SQL val ds = Seq(1, 3, 2, 5).toDS() ds.select(typed.sum((i: Int) => i), typed.avg((i: Int) => i)).show() ``` Output before change. ```SQL +-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+ \|typedsumdouble(unresolveddeserializer(upcast(input[0, int], IntegerType, - root class: "scala.Int"), value#1), upcast(value))\|typedaverage(unresolveddeserializer(upcast(input[0, int], IntegerType, - root class: "scala.Int"), value#1), newInstance(class scala.Tuple2))\| +-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+ \| 11.0\| 2.75\| +-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+ ``` Output after change: ```SQL +-----------------+---------------+ \|typedsumdouble_c1\|typedaverage_c2\| +-----------------+---------------+ \| 11.0\| 2.75\| +-----------------+---------------+ ``` Note: There is one test in ParquetSuites.scala which shows that that the system picked alias name is not usable and is rejected. [test](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala#L672-#L687) ## How was this patch tested? A new test was added in DataSetAggregatorSuite. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #13045 from dilipbiswal/spark-15114.
*	[SPARK-15462][SQL][TEST] unresolved === false` is enough in testcases.	Dongjoon Hyun	2016-05-21	4	-12/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In only `catalyst` module, there exists 8 evaluation test cases on unresolved expressions. But, in real-world situation, those cases doesn't happen since they occurs exceptions before evaluations. ```scala scala> sql("select format_number(null, 3)") res0: org.apache.spark.sql.DataFrame = [format_number(CAST(NULL AS DOUBLE), 3): string] scala> sql("select format_number(cast(null as NULL), 3)") org.apache.spark.sql.catalyst.parser.ParseException: DataType null() is not supported.(line 1, pos 34) ``` This PR makes those testcases more realistic. ```scala - checkEvaluation(FormatNumber(Literal.create(null, NullType), Literal(3)), null) + assert(FormatNumber(Literal.create(null, NullType), Literal(3)).resolved === false) ``` Also, this PR also removes redundant `resolved` checking in `FoldablePropagation` optimizer. ## How was this patch tested? Pass the modified Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13241 from dongjoon-hyun/SPARK-15462.
*	[SPARK-15445][SQL] Build fails for java 1.7 after adding java.mathBigInteger ↵	Sandeep Singh	2016-05-21	1	-11/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	support ## What changes were proposed in this pull request? Using longValue() and then checking whether the value is in the range for a long manually. ## How was this patch tested? Existing tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13223 from techaddict/SPARK-15445.
*	[SPARK-15424][SPARK-15437][SPARK-14807][SQL] Revert Create a ↵	Reynold Xin	2016-05-20	7	-77/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	hivecontext-compatibility module ## What changes were proposed in this pull request? I initially asked to create a hivecontext-compatibility module to put the HiveContext there. But we are so close to Spark 2.0 release and there is only a single class in it. It seems overkill to have an entire package, which makes it more inconvenient, for a single class. ## How was this patch tested? Tests were moved. Author: Reynold Xin <rxin@databricks.com> Closes #13207 from rxin/SPARK-15424.
*	[SPARK-15456][PYSPARK] Fixed PySpark shell context initialization when ↵	Bryan Cutler	2016-05-20	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	HiveConf not present ## What changes were proposed in this pull request? When PySpark shell cannot find HiveConf, it will fallback to create a SparkSession from a SparkContext. This fixes a bug caused by using a variable to SparkContext before it was initialized. ## How was this patch tested? Manually starting PySpark shell and using the SparkContext Author: Bryan Cutler <cutlerb@gmail.com> Closes #13237 from BryanCutler/pyspark-shell-session-context-SPARK-15456.
*	[SPARK-15031][EXAMPLE] Use SparkSession in examples	Zheng RuiFeng	2016-05-20	34	-146/+279
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Use `SparkSession` according to [SPARK-15031](https://issues.apache.org/jira/browse/SPARK-15031) `MLLLIB` is not recommended to use now, so examples in `MLLIB` are ignored in this PR. `StreamingContext` can not be directly obtained from `SparkSession`, so example in `Streaming` are ignored too. cc andrewor14 ## How was this patch tested? manual tests with spark-submit Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13164 from zhengruifeng/use_sparksession_ii.
*	[SPARK-15273] YarnSparkHadoopUtil#getOutOfMemoryErrorArgument should respect ↵	tedyu	2016-05-20	2	-15/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	OnOutOfMemoryError parameter given by user ## What changes were proposed in this pull request? As Nirav reported in this thread: http://search-hadoop.com/m/q3RTtdF3yNLMd7u YarnSparkHadoopUtil#getOutOfMemoryErrorArgument previously specified 'kill %p' unconditionally. We should respect the parameter given by user. ## How was this patch tested? Existing tests Author: tedyu <yuzhihong@gmail.com> Closes #13057 from tedyu/master.
*	[SPARK-15078] [SQL] Add all TPCDS 1.4 benchmark queries for SparkSQL	Sameer Agarwal	2016-05-20	106	-1226/+4858
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Now that SparkSQL supports all TPC-DS queries, this patch adds all 99 benchmark queries inside SparkSQL. ## How was this patch tested? Benchmark only Author: Sameer Agarwal <sameer@databricks.com> Closes #13188 from sameeragarwal/tpcds-all.
*	[SPARK-15454][SQL] Filter out files starting with _	Reynold Xin	2016-05-20	2	-5/+16
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Many other systems (e.g. Impala) uses _xxx as staging, and Spark should not be reading those files. ## How was this patch tested? Added a unit test case. Author: Reynold Xin <rxin@databricks.com> Closes #13227 from rxin/SPARK-15454.
*	[SPARK-15438][SQL] improve explain of whole stage codegen	Davies Liu	2016-05-20	3	-67/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently, the explain of a query with whole-stage codegen looks like this ``` >>> df = sqlCtx.range(1000);df2 = sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 'id').explain() == Physical Plan == WholeStageCodegen : +- Project [id#1L] : +- BroadcastHashJoin [id#1L], [id#4L], Inner, BuildRight, None : :- Range 0, 1, 4, 1000, [id#1L] : +- INPUT +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint])) +- WholeStageCodegen : +- Range 0, 1, 4, 1000, [id#4L] ``` The problem is that the plan looks much different than logical plan, make us hard to understand the plan (especially when the logical plan is not showed together). This PR will change it to: ``` >>> df = sqlCtx.range(1000);df2 = sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 'id').explain() == Physical Plan == Project [id#0L] +- BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight, None :- Range 0, 1, 4, 1000, [id#0L] +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- Range 0, 1, 4, 1000, [id#3L] ``` The `*`before the plan means that it's part of whole-stage codegen, it's easy to understand. ## How was this patch tested? Manually ran some queries and check the explain. Author: Davies Liu <davies@databricks.com> Closes #13204 from davies/explain_codegen.
*	[SPARK-10216][SQL] Revert "[] Avoid creating empty files during overwrit…	Michael Armbrust	2016-05-20	4	-182/+126
\| \| \| \| \| \| \| \|	This reverts commit 8d05a7a from #12855, which seems to have caused regressions when working with empty DataFrames. Author: Michael Armbrust <michael@databricks.com> Closes #13181 from marmbrus/revert12855.
*	[SPARK-15190][SQL] Support using SQLUserDefinedType for case classes	Shixiong Zhu	2016-05-20	2	-36/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Right now inferring the schema for case classes happens before searching the SQLUserDefinedType annotation, so the SQLUserDefinedType annotation for case classes doesn't work. This PR simply changes the inferring order to resolve it. I also reenabled the java.math.BigDecimal test and added two tests for `List`. ## How was this patch tested? `encodeDecodeTest(UDTCaseClass(new java.net.URI("http://spark.apache.org/")), "udt with case class")` Author: Shixiong Zhu <shixiong@databricks.com> Closes #12965 from zsxwing/SPARK-15190.
*	[SPARK-15165] [SPARK-15205] [SQL] Introduce place holder for comments in ↵	Kousuke Saruta	2016-05-20	15	-57/+95
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	generated code ## What changes were proposed in this pull request? This PR introduce place holder for comment in generated code and the purpose is same for #12939 but much safer. Generated code to be compiled doesn't include actual comments but includes place holder instead. Place holders in generated code will be replaced with actual comments only at the time of logging. Also, this PR can resolve SPARK-15205. ## How was this patch tested? Existing tests. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #12979 from sarutak/SPARK-15205.
*	[HOTFIX] disable stress test	Davies Liu	2016-05-20	1	-1/+2
\|
*	[SPARK-15360][SPARK-SUBMIT] Should print spark-submit usage when no ↵	wm624@hotmail.com	2016-05-20	2	-19/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	arguments is specified (Please fill in changes proposed in this fix) In 2.0, ./bin/spark-submit doesn't print out usage, but it raises an exception. In this PR, an exception handling is added in the Main.java when the exception is thrown. In the handling code, if there is no additional argument, it prints out usage. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manually tested. ./bin/spark-submit Usage: spark-submit [options] <app jar \| python file> [app arguments] Usage: spark-submit --kill [submission ID] --master [spark://...] Usage: spark-submit --status [submission ID] --master [spark://...] Usage: spark-submit run-example [options] example-class [example args] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). --class CLASS_NAME Your application's main class (for Java / Scala apps). --name NAME A name of your application. --jars JARS Comma-separated list of local jars to include on the driver and executor classpaths. --packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #13163 from wangmiao1981/submit.
*	[SPARK-15400][SQL] CreateNamedStruct and CreateNamedStructUnsafe should ↵	Takuya UESHIN	2016-05-20	2	-5/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	preserve metadata of value expressions if it is NamedExpression. ## What changes were proposed in this pull request? `CreateNamedStruct` and `CreateNamedStructUnsafe` should preserve metadata of value expressions if it is `NamedExpression` like `CreateStruct` or `CreateStructUnsafe` are doing. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13193 from ueshin/issues/SPARK-15400.
*	[SPARK-15435][SQL] Append Command to all commands	Reynold Xin	2016-05-20	20	-170/+173
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We started this convention to append Command suffix to all SQL commands. However, not all commands follow that convention. This patch adds Command suffix to all RunnableCommands. ## How was this patch tested? Updated test cases to reflect the renames. Author: Reynold Xin <rxin@databricks.com> Closes #13215 from rxin/SPARK-15435.
*	[SPARK-15308][SQL] RowEncoder should preserve nested column name.	Takuya UESHIN	2016-05-20	2	-10/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The following code generates wrong schema: ``` val schema = new StructType().add( "struct", new StructType() .add("i", IntegerType, nullable = false) .add( "s", new StructType().add("int", IntegerType, nullable = false), nullable = false), nullable = false) val ds = sqlContext.range(10).map(l => Row(l, Row(l)))(RowEncoder(schema)) ds.printSchema() ``` This should print as follows: ``` root \|-- struct: struct (nullable = false) \| \|-- i: integer (nullable = false) \| \|-- s: struct (nullable = false) \| \| \|-- int: integer (nullable = false) ``` but the result is: ``` root \|-- struct: struct (nullable = false) \| \|-- col1: integer (nullable = false) \| \|-- col2: struct (nullable = false) \| \| \|-- col1: integer (nullable = false) ``` This PR fixes `RowEncoder` to preserve nested column name. ## How was this patch tested? Existing tests and I added a test to check if `RowEncoder` preserves nested column name. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13090 from ueshin/issues/SPARK-15308.
*	[SPARK-15222][SPARKR][ML] SparkR ML examples update in 2.0	Yanbo Liang	2016-05-20	1	-17/+112
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Update example code in examples/src/main/r/ml.R to reflect the new algorithms. * spark.glm and glm * spark.survreg * spark.naiveBayes * spark.kmeans ## How was this patch tested? Offline test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13000 from yanboliang/spark-15222.
*	[SPARK-15203][DEPLOY] The spark daemon shell script error, daemon process ↵	WeichenXu	2016-05-20	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	start successfully but script output fail message ## What changes were proposed in this pull request? fix the bug: The spark daemon shell script error, daemon process start successfully but script output fail message ## How was this patch tested? existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #13172 from WeichenXu123/fix-spark-15203.
*	[SPARK-15444][PYSPARK][ML][HOTFIX] Default value mismatch of param ↵	Liang-Chi Hsieh	2016-05-20	1	-6/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	linkPredictionCol for GeneralizedLinearRegression ## What changes were proposed in this pull request? Default value mismatch of param linkPredictionCol for GeneralizedLinearRegression between PySpark and Scala. That is because default value conflict between #13106 and #13129. This causes ml.tests failed. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13220 from viirya/hotfix-regresstion.
*	[SPARK-15417][SQL][PYTHON] PySpark shell always uses in-memory catalog	Andrew Or	2016-05-19	2	-3/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? There is no way to use the Hive catalog in `pyspark-shell`. This is because we used to create a `SparkContext` before calling `SparkSession.enableHiveSupport().getOrCreate()`, which just gets the existing `SparkContext` instead of creating a new one. As a result, `spark.sql.catalogImplementation` was never propagated. ## How was this patch tested? Manual. Author: Andrew Or <andrew@databricks.com> Closes #13203 from andrewor14/fix-pyspark-shell.
*	[SPARK-15421][SQL] Validate DDL property values	Andrew Or	2016-05-19	2	-9/+77
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? When we parse DDLs involving table or database properties, we need to validate the values. E.g. if we alter a database's property without providing a value: ``` ALTER DATABASE my_db SET DBPROPERTIES('some_key') ``` Then we'll ignore it with Hive, but override the property with the in-memory catalog. Inconsistencies like these arise because we don't validate the property values. In such cases, we should throw exceptions instead. ## How was this patch tested? `DDLCommandSuite` Author: Andrew Or <andrew@databricks.com> Closes #13205 from andrewor14/ddl-prop-values.
*	[SPARK-15367][SQL] Add refreshTable back	gatorsmile	2016-05-20	7	-26/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? `refreshTable` was a method in `HiveContext`. It was deleted accidentally while we were migrating the APIs. This PR is to add it back to `HiveContext`. In addition, in `SparkSession`, we put it under the catalog namespace (`SparkSession.catalog.refreshTable`). #### How was this patch tested? Changed the existing test cases to use the function `refreshTable`. Also added a test case for refreshTable in `hivecontext-compatibility` Author: gatorsmile <gatorsmile@gmail.com> Closes #13156 from gatorsmile/refreshTable.
*	[SPARK-15339][ML] ML 2.0 QA: Scala APIs and code audit for regression	Yanbo Liang	2016-05-19	5	-47/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? * ```GeneralizedLinearRegression``` API docs enhancement. * The default value of ```GeneralizedLinearRegression``` ```linkPredictionCol``` is not set rather than empty. This will consistent with other similar params such as ```weightCol``` * Make some methods more private. * Fix a minor bug of LinearRegression. * Fix some other issues. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13129 from yanboliang/spark-15339.
*	[SPARK-15394][ML][DOCS] User guide typos and grammar audit	sethah	2016-05-19	5	-46/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Correct some typos and incorrectly worded sentences. ## How was this patch tested? Doc changes only. Note that many of these changes were identified by whomfire01 Author: sethah <seth.hendrickson16@gmail.com> Closes #13180 from sethah/ml_guide_audit.
*	[SPARK-15398][ML] Update the warning message to recommend ML usage	Zheng RuiFeng	2016-05-19	12	-37/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? MLlib are not recommended to use, and some methods are even deprecated. Update the warning message to recommend ML usage. ``` def showWarning() { System.err.println( """WARN: This is a naive implementation of Logistic Regression and is given as an example! \|Please use either org.apache.spark.mllib.classification.LogisticRegressionWithSGD or \|org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS \|for more conventional use. """.stripMargin) } ``` To ``` def showWarning() { System.err.println( """WARN: This is a naive implementation of Logistic Regression and is given as an example! \|Please use org.apache.spark.ml.classification.LogisticRegression \|for more conventional use. """.stripMargin) } ``` ## How was this patch tested? local build Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13190 from zhengruifeng/update_recd.
*	[SPARK-15363][ML][EXAMPLE] Example code shouldn't use VectorImplicits._, ↵	wm624@hotmail.com	2016-05-19	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	asML/fromML ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) In this DataFrame example, we use VectorImplicits._, which is private API. Since Vectors object has public API, we use Vectors.fromML instead of implicts. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manually run the example. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #13213 from wangmiao1981/ml.
*	[SPARK-15335][SQL] Implement TRUNCATE TABLE Command	Lianhui Wang	2016-05-19	3	-0/+151
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Like TRUNCATE TABLE Command in Hive, TRUNCATE TABLE is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-446 This PR is to implement such a command for truncate table excluded column truncation(HIVE-4005). ## How was this patch tested? Added a test case. Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #13170 from lianhuiwang/truncate.
*	[SPARK-15313][SQL] EmbedSerializerInFilter rule should keep exprIds of ↵	Takuya UESHIN	2016-05-19	3	-3/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	output of surrounded SerializeFromObject. ## What changes were proposed in this pull request? The following code: ``` val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS() ds.filter(_._1 == "b").select(expr("_1").as[String]).foreach(println(_)) ``` throws an Exception: ``` org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _1#420 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) ... Cause: java.lang.RuntimeException: Couldn't find _1#420 in [_1#416,_2#417] at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) ... ``` This is because `EmbedSerializerInFilter` rule drops the `exprId`s of output of surrounded `SerializeFromObject`. The analyzed and optimized plans of the above example are as follows: ``` == Analyzed Logical Plan == _1: string Project [_1#420] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421] +- Filter <function1>.apply +- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2 +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]] == Optimized Logical Plan == !Project [_1#420] +- Filter <function1>.apply +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]] ``` This PR fixes `EmbedSerializerInFilter` rule to keep `exprId`s of output of surrounded `SerializeFromObject`. The plans after this patch are as follows: ``` == Analyzed Logical Plan == _1: string Project [_1#420] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421] +- Filter <function1>.apply +- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2 +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]] == Optimized Logical Plan == Project [_1#416] +- Filter <function1>.apply +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]] ``` ## How was this patch tested? Existing tests and I added a test to check if `filter and then select` works. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13096 from ueshin/issues/SPARK-15313.
*	[SPARK-14261][SQL] Memory leak in Spark Thrift Server	Oleg Danilov	2016-05-19	1	-0/+2
\| \| \| \| \| \| \| \|	Fixed memory leak (HiveConf in the CommandProcessorFactory) Author: Oleg Danilov <oleg.danilov@wandisco.com> Closes #12932 from dosoft/SPARK-14261.
*	[SPARK-14990][SQL] Fix checkForSameTypeInputExpr (ignore nullability)	Reynold Xin	2016-05-19	2	-4/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch fixes a bug in TypeUtils.checkForSameTypeInputExpr. Previously the code was testing on strict equality, which does not taking nullability into account. This is based on https://github.com/apache/spark/pull/12768. This patch fixed a bug there (with empty expression) and added a test case. ## How was this patch tested? Added a new test suite and test case. Closes #12768. Author: Reynold Xin <rxin@databricks.com> Author: Oleg Danilov <oleg.danilov@wandisco.com> Closes #13208 from rxin/SPARK-14990.
*	[SPARK-15075][SPARK-15345][SQL] Clean up SparkSession builder and propagate ↵	Reynold Xin	2016-05-19	43	-357/+367
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	config options to existing sessions if specified ## What changes were proposed in this pull request? Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that. This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession. ## How was this patch tested? Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches. Author: Reynold Xin <rxin@databricks.com> Closes #13200 from rxin/SPARK-15075.
*	[SPARK-11827][SQL] Adding java.math.BigInteger support in Java type ↵	Kevin Yu	2016-05-20	8	-6/+76
\| \| \| \| \| \| \| \| \| \|	inference for POJOs and Java collections Hello : Can you help check this PR? I am adding support for the java.math.BigInteger for java bean code path. I saw internally spark is converting the BigInteger to BigDecimal in ColumnType.scala and CatalystRowConverter.scala. I use the similar way and convert the BigInteger to the BigDecimal. . Author: Kevin Yu <qyu@us.ibm.com> Closes #10125 from kevinyu98/working_on_spark-11827.
*	[SPARK-15321] Fix bug where Array[Timestamp] cannot be encoded/decoded correctly	Sumedh Mungee	2016-05-20	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix `MapObjects.itemAccessorMethod` to handle `TimestampType`. Without this fix, `Array[Timestamp]` cannot be properly encoded or decoded. To reproduce this, in `ExpressionEncoderSuite`, if you add the following test case: `encodeDecodeTest(Array(Timestamp.valueOf("2016-01-29 10:00:00")), "array of timestamp") ` ... you will see that (without this fix) it fails with the following output: ``` - encode/decode for array of timestamp: [Ljava.sql.Timestamp;fd9ebde * FAILED * Exception thrown while decoding Converted: [0,1000000010,800000001,52a7ccdc36800] Schema: value#61615 root -- value: array (nullable = true) \|-- element: timestamp (containsNull = true) Encoder: class[value[0]: array<timestamp>] (ExpressionEncoderSuite.scala:312) ``` ## How was this patch tested? Existing tests Author: Sumedh Mungee <smungee@gmail.com> Closes #13108 from smungee/fix-itemAccessorMethod.
*	Closes #11915	Xiangrui Meng	2016-05-19	0	-0/+0
\| \| \| \| \|	Closes #8648 Closes #13089
*	[SPARK-15296][MLLIB] Refactor All Java Tests that use SparkSession	Sandeep Singh	2016-05-19	59	-1148/+207
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Refactor All Java Tests that use SparkSession, to extend SharedSparkSesion ## How was this patch tested? Existing Tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13101 from techaddict/SPARK-15296.
*	[SPARK-15416][SQL] Display a better message for not finding classes removed ↵	Shixiong Zhu	2016-05-19	1	-17/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in Spark 2.0 ## What changes were proposed in this pull request? If finding `NoClassDefFoundError` or `ClassNotFoundException`, check if the class name is removed in Spark 2.0. If so, the user must be using an incompatible library and we can provide a better message. ## How was this patch tested? 1. Run `bin/pyspark --packages com.databricks:spark-avro_2.10:2.0.1` 2. type `sqlContext.read.format("com.databricks.spark.avro").load("src/test/resources/episodes.avro")`. It will show `java.lang.ClassNotFoundException: org.apache.spark.sql.sources.HadoopFsRelationProvider is removed in Spark 2.0. Please check if your library is compatible with Spark 2.0` Author: Shixiong Zhu <shixiong@databricks.com> Closes #13201 from zsxwing/better-message.
*	[MINOR][ML][PYSPARK] ml.evaluation Scala and Python API sync	Yanbo Liang	2016-05-19	2	-5/+2
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? ```ml.evaluation``` Scala and Python API sync. ## How was this patch tested? Only API docs change, no new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13195 from yanboliang/evaluation-doc.
*	[SPARK-15341][DOC][ML] Add documentation for "model.write" to clarify ↵	Yanbo Liang	2016-05-19	5	-2/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	"summary" was not saved ## What changes were proposed in this pull request? Currently in ```model.write```, we don't save ```summary```(if applicable). We should add documentation to clarify it. We fixed the incorrect link ```[[MLWriter]]``` to ```[[org.apache.spark.ml.util.MLWriter]]``` BTW. ## How was this patch tested? Documentation update, no unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13131 from yanboliang/spark-15341.
*	[SPARK-15375][SQL][STREAMING] Add ConsoleSink to structure streaming	jerryshao	2016-05-19	3	-0/+76
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add ConsoleSink to structure streaming, user could use it to display dataframes on the console (useful for debugging and demostrating), similar to the functionality of `DStream#print`, to use it: ``` val query = result.write .format("console") .trigger(ProcessingTime("2 seconds")) .startStream() ``` ## How was this patch tested? local verified. Not sure it is suitable to add into structure streaming, please review and help to comment, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #13162 from jerryshao/SPARK-15375.
*	[SPARK-15414][MLLIB] Make the mllib,ml linalg type conversion APIs public	Sandeep Singh	2016-05-19	2	-18/+42
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Open up APIs for converting between new, old linear algebra types (in spark.mllib.linalg): `Sparse`/`Dense` X `Vector`/`Matrices` `.asML` and `.fromML` ## How was this patch tested? Existing Tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13202 from techaddict/SPARK-15414.
*	[SPARK-15361][ML] ML 2.0 QA: Scala APIs audit for ml.clustering	Yanbo Liang	2016-05-19	4	-21/+43
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Audit Scala API for ml.clustering. Fix some wrong API documentations and update outdated one. ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13148 from yanboliang/spark-15361.
*	[SPARK-15411][ML] Add @since to ml.stat.MultivariateOnlineSummarizer.scala	DB Tsai	2016-05-19	2	-5/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add since to ml.stat.MultivariateOnlineSummarizer.scala ## How was this patch tested? unit tests Author: DB Tsai <dbt@netflix.com> Closes #13197 from dbtsai/cleanup.