spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-15841][Tests] REPLSuite has incorrect env set for a couple of tests.	Prashant Sharma	2016-06-09	2	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	Description from JIRA. In ReplSuite, for a test that can be tested well on just local should not really have to start a local-cluster. And similarly a test is in-sufficiently run if it's actually fixing a problem related to a distributed run in environment with local run. Existing tests. Author: Prashant Sharma <prashsh1@in.ibm.com> Closes #13574 from ScrapCodes/SPARK-15841/repl-suite-fix.
*	[SPARK-12447][YARN] Only update the states when executor is successfully ↵	jerryshao	2016-06-09	2	-30/+47
\| \| \| \| \| \| \| \| \| \| \| \|	launched The details is described in https://issues.apache.org/jira/browse/SPARK-12447. vanzin Please help to review, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #10412 from jerryshao/SPARK-12447.
*	[SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date ↵	Herman van Hovell	2016-06-09	1	-24/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	functions ## What changes were proposed in this pull request? The current implementations of `UnixTime` and `FromUnixTime` do not cache their parser/formatter as much as they could. This PR resolved this issue. This PR is a take over from https://github.com/apache/spark/pull/13522 and further optimizes the re-use of the parser/formatter. It also fixes the improves handling (catching the actual exception instead of `Throwable`). All credits for this work should go to rajeshbalamohan. This PR closes https://github.com/apache/spark/pull/13522 ## How was this patch tested? Current tests. Author: Herman van Hovell <hvanhovell@databricks.com> Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #13581 from hvanhovell/SPARK-14321.
*	[SPARK-15839] Fix Maven doc-jar generation when JAVA_7_HOME is set	Josh Rosen	2016-06-09	1	-6/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? It looks like the nightly Maven snapshots broke after we set `JAVA_7_HOME` in the build: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/1573/. It seems that passing `-javabootclasspath` to ScalaDoc using scala-maven-plugin ends up preventing the Scala library classes from being added to scalac's internal class path, causing compilation errors while building doc-jars. There might be a principled fix to this inside of the scala-maven-plugin itself, but for now this patch configures the build to omit the `-javabootclasspath` option during Maven doc-jar generation. ## How was this patch tested? Tested manually with `build/mvn clean install -DskipTests=true` when `JAVA_7_HOME` was set. Also manually inspected the effective POM diff to verify that the final POM changes were scoped correctly: https://gist.github.com/JoshRosen/f889d1c236fad14fa25ac4be01654653 /cc vanzin and yhuai for review. Author: Josh Rosen <joshrosen@databricks.com> Closes #13573 from JoshRosen/SPARK-15839.
*	[SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven Central	Josh Rosen	2016-06-09	2	-28/+9
\| \| \| \| \| \| \| \| \| \| \| \|	Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but depends on that fork via a SBT subproject which is cloned from https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This unnecessarily slows down the initial build on fresh machines and is also risky because it risks a build breakage in case that GitHub repository ever changes or is deleted. In order to address these issues, I have published a pre-built binary of our forked sbt-pom-reader plugin to Maven Central under the `org.spark-project` namespace and have updated Spark's build to use that artifact. This published artifact was built from https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark, which contains the contents of ScrapCodes's branch plus an additional patch to configure the build for artifact publication. /cc srowen ScrapCodes for review. Author: Josh Rosen <joshrosen@databricks.com> Closes #13564 from JoshRosen/use-published-fork-of-pom-reader.
*	[SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" property	Jeff Zhang	2016-06-09	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? add method idf to IDF in pyspark ## How was this patch tested? add unit test Author: Jeff Zhang <zjffdu@apache.org> Closes #13540 from zjffdu/SPARK-15788.
*	[SPARK-15804][SQL] Include metadata in the toStructType	Kevin Yu	2016-06-09	2	-1/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The help function 'toStructType' in the AttributeSeq class doesn't include the metadata when it builds the StructField, so it causes this reported problem https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK when spark writes the the dataframe with the metadata to the parquet datasource. The code path is when spark writes the dataframe to the parquet datasource through the InsertIntoHadoopFsRelationCommand, spark will build the WriteRelation container, and it will call the help function 'toStructType' to create StructType which contains StructField, it should include the metadata there, otherwise, we will lost the user provide metadata. ## How was this patch tested? added test case in ParquetQuerySuite.scala (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Kevin Yu <qyu@us.ibm.com> Closes #13555 from kevinyu98/spark-15804.
*	[SPARK-15818][BUILD] Upgrade to Hadoop 2.7.2	Adam Roberts	2016-06-09	4	-48/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Updating the Hadoop version from 2.7.0 to 2.7.2 if we use the Hadoop-2.7 build profile ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Existing tests (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating Hadoop 2.7.0 is not ready for production use https://hadoop.apache.org/docs/r2.7.0/ states "Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building upon the previous stable release 2.6.0. This release is not yet ready for production use. Production users should use 2.7.1 release and beyond." Hadoop 2.7.1 release notes: "Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building upon the previous release 2.7.0. This is the next stable release after Apache Hadoop 2.6.x." And then Hadoop 2.7.2 release notes: "Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building upon the previous stable release 2.7.1." I've tested this is OK with Intel hardware and IBM Java 8 so let's test it with OpenJDK, ideally this will be pushed to branch-2.0 and master. Author: Adam Roberts <aroberts@uk.ibm.com> Closes #13556 from a-roberts/patch-2.
*	[SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty ↵	Josh Rosen	2016-06-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	.m2 cache This patch fixes a bug in `./dev/test-dependencies.sh` which caused spurious failures when the script was run on a machine with an empty `.m2` cache. The problem was that extra log output from the dependency download was conflicting with the grep / regex used to identify the classpath in the Maven output. This patch fixes this issue by adjusting the regex pattern. Tested manually with the following reproduction of the bug: ``` rm -rf ~/.m2/repository/org/apache/commons/ ./dev/test-dependencies.sh ``` Author: Josh Rosen <joshrosen@databricks.com> Closes #13568 from JoshRosen/SPARK-12712.
*	[MINOR][DOC] In Dataset docs, remove self link to Dataset and add link to Column	Sandeep Singh	2016-06-08	1	-100/+100
\| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Documentation Fix ## How was this patch tested? Author: Sandeep Singh <sandeep@techaddict.me> Closes #13567 from techaddict/minor-4.
*	[SPARK-14670] [SQL] allow updating driver side sql metrics	Wenchen Fan	2016-06-08	3	-8/+85
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? On the SparkUI right now we have this SQLTab that displays accumulator values per operator. However, it only displays metrics updated on the executors, not on the driver. It is useful to also include driver metrics, e.g. broadcast time. This is a different version from https://github.com/apache/spark/pull/12427. This PR sends driver side accumulator updates right after the updating happens, not at the end of execution, by a new event. ## How was this patch tested? new test in `SQLListenerSuite` ![qq20160606-0](https://cloud.githubusercontent.com/assets/3182036/15841418/0eb137da-2c06-11e6-9068-5694eeb78530.png) Author: Wenchen Fan <wenchen@databricks.com> Closes #13189 from cloud-fan/metrics.
*	[SPARK-15735] Allow specifying min time to run in microbenchmarks	Eric Liang	2016-06-08	1	-37/+72
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This makes microbenchmarks run for at least 2 seconds by default, to allow some time for jit compilation to kick in. ## How was this patch tested? Tested manually with existing microbenchmarks. This change is backwards compatible in that existing microbenchmarks which specified numIters per-case will still run exactly that number of iterations. Microbenchmarks which previously overrode defaultNumIters now override minNumIters. cc hvanhovell Author: Eric Liang <ekl@databricks.com> Author: Eric Liang <ekhliang@gmail.com> Closes #13472 from ericl/spark-15735.
*	[DOCUMENTATION] Fixed target JAR path	prabs	2016-06-08	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Mentioned Scala version in the sbt configuration file is 2.11, so the path of the target JAR should be `/target/scala-2.11/simple-project_2.11-1.0.jar` ## How was this patch tested? n/a Author: prabs <prabsmails@gmail.com> Author: Prabeesh K <prabsmails@gmail.com> Closes #13554 from prabeesh/master.
*	[MINOR] Fix Java Lint errors introduced by #13286 and #13280	Sandeep Singh	2016-06-08	3	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? revived #13464 Fix Java Lint errors introduced by #13286 and #13280 Before: ``` Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]$'. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]$'. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type. ``` ## How was this patch tested? ran `dev/lint-java` locally Author: Sandeep Singh <sandeep@techaddict.me> Closes #13559 from techaddict/minor-3.
*	[SPARK-15793][ML] Add maxSentenceLength for ml.Word2Vec	yinxusen	2016-06-08	2	-0/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-15793 Word2vec in ML package should have maxSentenceLength method for feature parity. ## How was this patch tested? Tested with Spark unit test. Author: yinxusen <yinxusen@gmail.com> Closes #13536 from yinxusen/SPARK-15793.
*	[SPARK-15789][SQL] Allow reserved keywords in most places	Herman van Hovell	2016-06-07	6	-28/+35
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The parser currently does not allow the use of some SQL keywords as table or field names. This PR adds supports for all keywords as identifier. The exception to this are table aliases, in this case most keywords are allowed except for join keywords (```anti, full, inner, left, semi, right, natural, on, join, cross```) and set-operator keywords (```union, intersect, except```). ## How was this patch tested? I have added/move/renamed test in the catalyst `*ParserSuite`s. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13534 from hvanhovell/SPARK-15789.
*	[SPARK-15580][SQL] Add ContinuousQueryInfo to make ContinuousQueryListener ↵	Shixiong Zhu	2016-06-07	8	-76/+203
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	events serializable ## What changes were proposed in this pull request? This PR adds ContinuousQueryInfo to make ContinuousQueryListener events serializable in order to support writing events into the event log. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13335 from zsxwing/query-info.
*	[SPARK-14485][CORE] ignore task finished for executor lost and removed by driver	zhonghaihua	2016-06-07	1	-1/+13
\| \| \| \| \| \| \| \| \| \| \| \| \|	Now, when executor is removed by driver with heartbeats timeout, driver will re-queue the task on this executor and send a kill command to cluster to kill this executor. But, in a situation, the running task of this executor is finished and return result to driver before this executor killed by kill command sent by driver. At this situation, driver will accept the task finished event and ignore speculative task and re-queued task. But, as we know, this executor has removed by driver, the result of this finished task can not save in driver because the BlockManagerId has also removed from BlockManagerMaster by driver. So, the result data of this stage is not complete, and then, it will cause fetch failure. For more details, [link to jira issues SPARK-14485](https://issues.apache.org/jira/browse/SPARK-14485) This PR introduce a mechanism to ignore this kind of task finished. N/A Author: zhonghaihua <793507405@qq.com> Closes #12258 from zhonghaihua/ignoreTaskFinishForExecutorLostAndRemovedByDriver.
*	[SPARK-13590][ML][DOC] Document spark.ml LiR, LoR and AFTSurvivalRegression ↵	Yanbo Liang	2016-06-07	4	-1/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	behavior difference ## What changes were proposed in this pull request? When fitting ```LinearRegressionModel```(by "l-bfgs" solver) and ```LogisticRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce same model as R glmnet but different from LIBSVM. When fitting ```AFTSurvivalRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce different model compared with R survival::survreg. We should output a warning message and clarify in document for this condition. ## How was this patch tested? Document change, no unit test. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #12731 from yanboliang/spark-13590.
*	[SPARK-15674][SQL] Deprecates "CREATE TEMPORARY TABLE USING...", uses "CREAT ↵	Sean Zhong	2016-06-07	5	-6/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TEMPORARY VIEW USING..." instead ## What changes were proposed in this pull request? The current implementation of "CREATE TEMPORARY TABLE USING datasource..." is NOT creating any intermediate temporary data directory like temporary HDFS folder, instead, it only stores a SQL string in memory. Probably we should use "TEMPORARY VIEW" instead. This PR assumes a temporary table has to link with some temporary intermediate data. It follows the definition of temporary table like this (from [hortonworks doc](https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/temp-tables.html)): > A temporary table is a convenient way for an application to automatically manage intermediate data generated during a complex query Example: ``` scala> spark.sql("CREATE temporary view my_tab7 (c1: String, c2: String) USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')") scala> spark.sql("select c1, c2 from my_tab7").show() +----+-----+ \| c1\| c2\| +----+-----+ \|year\| make\| \|2012\|Tesla\| ... ``` It NOW prints a deprecation warning if "CREATE TEMPORARY TABLE USING..." is used. ``` scala> spark.sql("CREATE temporary table my_tab7 (c1: String, c2: String) USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')") 16/05/31 10:39:27 WARN SparkStrategies$DDLStrategy: CREATE TEMPORARY TABLE tableName USING... is deprecated, please use CREATE TEMPORARY VIEW viewName USING... instead ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13414 from clockfly/create_temp_view_using.
*	[SPARK-15760][DOCS] Add documentation for package-related configs.	Marcelo Vanzin	2016-06-07	1	-0/+47
\| \| \| \| \| \| \| \| \|	While there, also document spark.files and spark.jars. Text is the same as the spark-submit help text with some minor adjustments. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #13502 from vanzin/SPARK-15760.
*	[SPARK-15684][SPARKR] Not mask startsWith and endsWith in R	wm624@hotmail.com	2016-06-07	3	-3/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In R 3.3.0, startsWith and endsWith are added. In this PR, I make the two work in SparkR. 1. Remove signature in generic.R 2. Add setMethod in column.R 3. Add unit tests ## How was this patch tested? Manually test it through SparkR shell for both column data and string data, which are added into the unit test file. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #13476 from wangmiao1981/start.
*	[MINOR] fix typo in documents	WeichenXu	2016-06-07	3	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? I use spell check tools checks typo in spark documents and fix them. ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #13538 from WeichenXu123/fix_doc_typo.
*	[SPARK-15792][SQL] Allows operator to change the verbosity in explain output	Sean Zhong	2016-06-06	7	-18/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR allows customization of verbosity in explain output. After change, `dataframe.explain()` and `dataframe.explain(true)` has different verbosity output for physical plan. Currently, this PR only enables verbosity string for operator `HashAggregateExec` and `SortAggregateExec`. We will gradually enable verbosity string for more operators in future. Less verbose mode: dataframe.explain(extended = false) `output=[count(a)#85L]` is NOT displayed for HashAggregate. ``` scala> Seq((1,2,3)).toDF("a", "b", "c").createTempView("df2") scala> spark.sql("select count(a) from df2").explain() == Physical Plan == HashAggregate(key=[], functions=[count(1)]) +- Exchange SinglePartition +- HashAggregate(key=[], functions=[partial_count(1)]) +- LocalTableScan ``` Verbose mode: dataframe.explain(extended = true) `output=[count(a)#85L]` is displayed for HashAggregate. ``` scala> spark.sql("select count(a) from df2").explain(true) // "output=[count(a)#85L]" is added ... == Physical Plan == HashAggregate(key=[], functions=[count(1)], output=[count(a)#85L]) +- Exchange SinglePartition +- HashAggregate(key=[], functions=[partial_count(1)], output=[count#87L]) +- LocalTableScan ``` ## How was this patch tested? Manual test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13535 from clockfly/verbose_breakdown_2.
*	[SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema	Sean Zhong	2016-06-06	5	-10/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR makes sure the typed Filter doesn't change the Dataset schema. Before the change: ``` scala> val df = spark.range(0,9) scala> df.schema res12: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false)) scala> val afterFilter = df.filter(_=>true) scala> afterFilter.schema // !!! schema is CHANGED!!! Column name is changed from id to value, nullable is changed from false to true. res13: org.apache.spark.sql.types.StructType = StructType(StructField(value,LongType,true)) ``` SerializeFromObject and DeserializeToObject are inserted to wrap the Filter, and these two can possibly change the schema of Dataset. After the change: ``` scala> afterFilter.schema // schema is NOT changed. res47: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false)) ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13529 from clockfly/spark-15632.
*	[SPARK-15652][LAUNCHER] Added a new State (LOST) for the listeners of ↵	Subroto Sanyal	2016-06-06	3	-1/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SparkLauncher ## What changes were proposed in this pull request? This situation can happen when the LauncherConnection gets an exception while reading through the socket and terminating silently without notifying making the client/listener think that the job is still in previous state. The fix force sends a notification to client that the job finished with unknown status and let client handle it accordingly. ## How was this patch tested? Added a unit test. Author: Subroto Sanyal <ssanyal@datameer.com> Closes #13497 from subrotosanyal/SPARK-15652-handle-spark-submit-jvm-crash.
*	[SPARK-15783][CORE] still some flakiness in these blacklist tests so ignore ↵	Imran Rashid	2016-06-06	2	-3/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	for now ## What changes were proposed in this pull request? There is still some flakiness in BlacklistIntegrationSuite, so turning it off for the moment to avoid breaking more builds -- will turn it back with more fixes. ## How was this patch tested? jenkins. Author: Imran Rashid <irashid@cloudera.com> Closes #13528 from squito/ignore_blacklist.
*	[SPARK-15764][SQL] Replace N^2 loop in BindReferences	Josh Rosen	2016-06-06	6	-15/+40
\| \| \| \| \| \| \| \| \| \| \| \|	BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the `input` array. Because input can sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n). Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups. Perf. benchmarks to follow. /cc ericl Author: Josh Rosen <joshrosen@databricks.com> Closes #13505 from JoshRosen/bind-references-improvement.
*	[SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable public	Joseph K. Bradley	2016-06-06	2	-3/+163
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Made DefaultParamsReadable, DefaultParamsWritable public. Also added relevant doc and annotations. Added UnaryTransformerExample to demonstrate use of UnaryTransformer and DefaultParamsReadable,Writable. ## How was this patch tested? Wrote example making use of the now-public APIs. Compiled and ran locally Author: Joseph K. Bradley <joseph@databricks.com> Closes #13461 from jkbradley/defaultparamswritable.
*	[SPARK-14279][BUILD] Pick the spark version from pom	Dhruve Ashar	2016-06-06	6	-8/+150
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Change the way spark picks up version information. Also embed the build information to better identify the spark version running. More context can be found here : https://github.com/apache/spark/pull/12152 ## How was this patch tested? Ran the mvn and sbt builds to verify the version information was being displayed correctly on executing <code>spark-submit --version </code> ![image](https://cloud.githubusercontent.com/assets/7732317/15197251/f7c673a2-1795-11e6-8b2f-88f2a70cf1c1.png) Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #13061 from dhruve/impr/SPARK-14279.
*	[SPARK-14900][ML][PYSPARK] Add accuracy and deprecate precison,recall,f1	Zheng RuiFeng	2016-06-06	1	-0/+18
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 1, add accuracy for MulticlassMetrics 2, deprecate overall precision,recall,f1 and recommend accuracy usage ## How was this patch tested? manual tests in pyspark shell Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13511 from zhengruifeng/deprecate_py_precisonrecall.
*	[SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ↵	Yanbo Liang	2016-06-06	19	-37/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ML examples ## What changes were proposed in this pull request? Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML examples broken. ```python pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.' ``` We should use ```accuracy``` to replace ```precision``` in these examples. ## How was this patch tested? Offline tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13519 from yanboliang/spark-15771.
*	[MINOR] Fix Typos 'an -> a'	Zheng RuiFeng	2016-06-06	70	-102/+102
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `an -> a` Use cmds like `find . -name '*.R' \| xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one. ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13515 from zhengruifeng/an_a.
*	Revert "[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour"	Reynold Xin	2016-06-05	3	-55/+48
\| \| \| \|	This reverts commit b7e8d1cb3ce932ba4a784be59744af8a8ef027ce.
*	[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour	Takeshi YAMAMURO	2016-06-05	3	-48/+55
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This pr fixes the behaviour of `format("csv").option("quote", null)` along with one of spark-csv. Also, it explicitly sets default values for CSV options in python. ## How was this patch tested? Added tests in CSVSuite. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13372 from maropu/SPARK-15585.
*	[SPARK-15704][SQL] add a test case in DatasetAggregatorSuite for regression ↵	Hiroshi Inoue	2016-06-05	1	-0/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	testing ## What changes were proposed in this pull request? This change fixes a crash in TungstenAggregate while executing "Dataset complex Aggregator" test case due to IndexOutOfBoundsException. jira entry for detail: https://issues.apache.org/jira/browse/SPARK-15704 ## How was this patch tested? Using existing unit tests (including DatasetBenchmark) Author: Hiroshi Inoue <inouehrs@jp.ibm.com> Closes #13446 from inouehrs/fix_aggregate.
*	[SPARK-15748][SQL] Replace inefficient foldLeft() call with flatMap() in ↵	Josh Rosen	2016-06-05	3	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \|	PartitionStatistics `PartitionStatistics` uses `foldLeft` and list concatenation (`++`) to flatten an iterator of lists, but this is extremely inefficient compared to simply doing `flatMap`/`flatten` because it performs many unnecessary object allocations. Simply replacing this `foldLeft` by a `flatMap` results in decent performance gains when constructing PartitionStatistics instances for tables with many columns. This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern. Author: Josh Rosen <joshrosen@databricks.com> Closes #13491 from JoshRosen/foldleft-to-flatmap.
*	[SPARK-15657][SQL] RowEncoder should validate the data type of input object	Wenchen Fan	2016-06-05	4	-40/+95
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR improves the error handling of `RowEncoder`. When we create a `RowEncoder` with a given schema, we should validate the data type of input object. e.g. we should throw an exception when a field is boolean but is declared as a string column. This PR also removes the support to use `Product` as a valid external type of struct type. This support is added at https://github.com/apache/spark/pull/9712, but is incomplete, e.g. nested product, product in array are both not working. However, we never officially support this feature and I think it's ok to ban it. ## How was this patch tested? new tests in `RowEncoderSuite`. Author: Wenchen Fan <wenchen@databricks.com> Closes #13401 from cloud-fan/bug.
*	[MINOR][R][DOC] Fix R documentation generation instruction.	Kai Jiang	2016-06-05	2	-22/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? changes in R/README.md - Make step of generating SparkR document more clear. - link R/DOCUMENTATION.md from R/README.md - turn on some code syntax highlight in R/README.md ## How was this patch tested? local test Author: Kai Jiang <jiangkai@gmail.com> Closes #13488 from vectorijk/R-Readme.
*	[SPARK-15770][ML] Annotation audit for Experimental and DeveloperApi	Zheng RuiFeng	2016-06-05	13	-3/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 1, remove comments `:: Experimental ::` for non-experimental API 2, add comments `:: Experimental ::` for experimental API 3, add comments `:: DeveloperApi ::` for developerApi API ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13514 from zhengruifeng/del_experimental.
*	[SPARK-15723] Fixed local-timezone-brittle test where short-timezone form ↵	Brett Randall	2016-06-05	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	"EST" is … ## What changes were proposed in this pull request? Stop using the abbreviated and ambiguous timezone "EST" in a test, since it is machine-local default timezone dependent, and fails in different timezones. Fixed [SPARK-15723](https://issues.apache.org/jira/browse/SPARK-15723). ## How was this patch tested? Note that to reproduce this problem in any locale/timezone, you can modify the scalatest-maven-plugin argLine to add a timezone: <argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="Australia/Sydney"</argLine> and run $ mvn test -DwildcardSuites=org.apache.spark.status.api.v1.SimpleDateParamSuite -Dtest=none. Equally this will fix it in an effected timezone: <argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="America/New_York"</argLine> To test the fix, apply the above change to `pom.xml` to set test TZ to `Australia/Sydney`, and confirm the test now passes. Author: Brett Randall <javabrett@gmail.com> Closes #13462 from javabrett/SPARK-15723-SimpleDateParamSuite.
*	[SPARK-15707][SQL] Make Code Neat - Use map instead of if check.	Weiqing Yang	2016-06-04	1	-6/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In forType function of object RandomDataGenerator, the code following: if (maybeSqlTypeGenerator.isDefined){ .... Some(generator) } else{ None } will be changed. Instead, maybeSqlTypeGenerator.map will be used. ## How was this patch tested? All of the current unit tests passed. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #13448 from Sherry302/master.
*	[SPARK-15762][SQL] Cache Metadata & StructType hashCodes; use singleton ↵	Josh Rosen	2016-06-04	2	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Metadata.empty We should cache `Metadata.hashCode` and use a singleton for `Metadata.empty` because calculating metadata hashCodes appears to be a bottleneck for certain workloads. We should also cache `StructType.hashCode`. In an optimizer stress-test benchmark run by ericl, these `hashCode` calls accounted for roughly 40% of the total CPU time and this bottleneck was completely eliminated by the caching added by this patch. Author: Josh Rosen <joshrosen@databricks.com> Closes #13504 from JoshRosen/metadata-fix.
*	[MINOR][BUILD] Add modernizr MIT license; specify "2014 and onwards" in ↵	Sean Owen	2016-06-04	3	-1/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	license copyright ## What changes were proposed in this pull request? Per conversation on dev list, add missing modernizr license. Specify "2014 and onwards" in copyright statement. ## How was this patch tested? (none required) Author: Sean Owen <sowen@cloudera.com> Closes #13510 from srowen/ModernizrLicense.
*	[SPARK-15617][ML][DOC] Clarify that fMeasure in MulticlassMetrics is "micro" ↵	Ruifeng Zheng	2016-06-04	4	-24/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	f1_score ## What changes were proposed in this pull request? 1, del precision,recall in `ml.MulticlassClassificationEvaluator` 2, update user guide for `mlllib.weightedFMeasure` ## How was this patch tested? local build Author: Ruifeng Zheng <ruifengz@foxmail.com> Closes #13390 from zhengruifeng/clarify_f1.
*	[SPARK-15756][SQL] Support command 'create table stored as ↵	Lianhui Wang	2016-06-03	2	-0/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	orcfile/parquetfile/avrofile' ## What changes were proposed in this pull request? Now Spark SQL can support 'create table src stored as orc/parquet/avro' for orc/parquet/avro table. But Hive can support both commands: ' stored as orc/parquet/avro' and 'stored as orcfile/parquetfile/avrofile'. So this PR supports these keywords 'orcfile/parquetfile/avrofile' in Spark SQL. ## How was this patch tested? add unit tests Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #13500 from lianhuiwang/SPARK-15756.
*	[SPARK-15754][YARN] Not letting the credentials containing hdfs delegation ↵	Subroto Sanyal	2016-06-03	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tokens to be added in current user credential. ## What changes were proposed in this pull request? The credentials are not added to the credentials of UserGroupInformation.getCurrentUser(). Further if the client has possibility to login using keytab then the updateDelegationToken thread is not started on client. ## How was this patch tested? ran dev/run-tests Author: Subroto Sanyal <ssanyal@datameer.com> Closes #13499 from subrotosanyal/SPARK-15754-save-ugi-from-changing.
*	[SPARK-15391] [SQL] manage the temporary memory of timsort	Davies Liu	2016-06-03	13	-64/+115
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently, the memory for temporary buffer used by TimSort is always allocated as on-heap without bookkeeping, it could cause OOM both in on-heap and off-heap mode. This PR will try to manage that by preallocate it together with the pointer array, same with RadixSort. It both works for on-heap and off-heap mode. This PR also change the loadFactor of BytesToBytesMap to 0.5 (it was 0.70), it enables use to radix sort also makes sure that we have enough memory for timsort. ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #13318 from davies/fix_timsort.
*	[SPARK-15168][PYSPARK][ML] Add missing params to MultilayerPerceptronClassifier	Holden Karau	2016-06-03	1	-9/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? MultilayerPerceptronClassifier is missing step size, solver, and weights. Add these params. Also clarify the scaladoc a bit while we are updating these params. Eventually we should follow up and unify the HasSolver params (filed https://issues.apache.org/jira/browse/SPARK-15169 ) ## How was this patch tested? Doc tests Author: Holden Karau <holden@us.ibm.com> Closes #12943 from holdenk/SPARK-15168-add-missing-params-to-MultilayerPerceptronClassifier.
*	[SPARK-15722][SQL] Disallow specifying schema in CTAS statement	Andrew Or	2016-06-03	6	-58/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? As of this patch, the following throws an exception because the schemas may not match: ``` CREATE TABLE students (age INT, name STRING) AS SELECT * FROM boxes ``` but this is OK: ``` CREATE TABLE students AS SELECT * FROM boxes ``` ## How was this patch tested? SQLQuerySuite, HiveDDLCommandSuite Author: Andrew Or <andrew@databricks.com> Closes #13490 from andrewor14/ctas-no-column.