spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-11485][SQL] Make DataFrameHolder and DatasetHolder public.	Reynold Xin	2015-11-04	4	-4/+21
\| \| \| \| \| \| \| \|	These two classes should be public, since they are used in public code. Author: Reynold Xin <rxin@databricks.com> Closes #9445 from rxin/SPARK-11485.
*	[SPARK-11235][NETWORK] Add ability to stream data using network lib.	Marcelo Vanzin	2015-11-04	20	-29/+1196
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current interface used to fetch shuffle data is not very efficient for large buffers; it requires the receiver to buffer the entirety of the contents being downloaded in memory before processing the data. To use the network library to transfer large files (such as those that can be added using SparkContext addJar / addFile), this change adds a more efficient way of downloding data, by streaming the data and feeding it to a callback as data arrives. This is achieved by a custom frame decoder that replaces the current netty one; this decoder allows entering a mode where framing is skipped and data is instead provided directly to a callback. The existing netty classes (ByteToMessageDecoder and LengthFieldBasedFrameDecoder) could not be reused since their semantics do not allow for the interception approach the new decoder uses. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9206 from vanzin/SPARK-11235.
*	[SPARK-10622][CORE][YARN] Differentiate dead from "mostly dead" executors.	Marcelo Vanzin	2015-11-04	7	-50/+157
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In YARN mode, when preemption is enabled, we may leave executors in a zombie state while we wait to retrieve the reason for which the executor exited. This is so that we don't account for failed tasks that were running on a preempted executor. The issue is that while we wait for this information, the scheduler might decide to schedule tasks on the executor, which will never be able to run them. Other side effects include the block manager still considering the executor available to cache blocks, for example. So, when we know that an executor went down but we don't know why, stop everything related to the executor, except its running tasks. Only when we know the reason for the exit (or give up waiting for it) we do update the running tasks. This is achieved by a new `disableExecutor()` method in the `Schedulable` interface. For managers that do not behave like this (i.e. every one but YARN), the existing `executorLost()` method will behave the same way it did before. On top of that change, a few minor changes that made debugging easier, and fixed some other minor issues: - The cluster-mode AM was printing a misleading log message every time an executor disconnected from the driver (because the akka actor system was shared between driver and AM). - Avoid sending unnecessary requests for an executor's exit reason when we already know it was explicitly disabled / killed. This avoids both multiple requests, and unnecessary requests that would just cause warning messages on the AM (in the explicit kill case). - Tone down a log message about the executor being lost when it exited normally (e.g. preemption) - Wake up the AM monitor thread when requests for executor loss reasons arrive too, so that we can more quickly remove executors from this zombie state. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8887 from vanzin/SPARK-10622.
*	[SPARK-11443] Reserve space lines	Xusen Yin	2015-11-04	1	-1/+1
\| \| \| \| \| \| \| \|	The trim_codeblock(lines) function in include_example.rb removes some blank lines in the code. Author: Xusen Yin <yinxusen@gmail.com> Closes #9400 from yinxusen/SPARK-11443.
*	[SPARK-11380][DOCS] Replace example code in mllib-frequent-pattern-mining.md ↵	Pravin Gadakh	2015-11-04	8	-161/+387
\| \| \| \| \| \| \| \| \|	using include_example Author: Pravin Gadakh <pravingadakh177@gmail.com> Author: Pravin Gadakh <prgadakh@in.ibm.com> Closes #9340 from pravingadakh/SPARK-11380.
*	[SPARK-9492][ML][R] LogisticRegression in R should provide model statistics	Yanbo Liang	2015-11-04	4	-8/+37
\| \| \| \| \| \| \| \|	Like ml ```LinearRegression```, ```LogisticRegression``` should provide a training summary including feature names and their coefficients. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9303 from yanboliang/spark-9492.
*	[SPARK-11442] Reduce numSlices for local metrics test of SparkListenerSuite	tedyu	2015-11-04	1	-4/+5
\| \| \| \| \| \| \| \| \| \|	In the thread, http://search-hadoop.com/m/q3RTtcQiFSlTxeP/test+failed+due+to+OOME&subj=test+failed+due+to+OOME, it was discussed that memory consumption for SparkListenerSuite should be brought down. This is an attempt in that direction by reducing numSlices for local metrics test. Author: tedyu <yuzhihong@gmail.com> Closes #9384 from tedyu/master.
*	[SPARK-2960][DEPLOY] Support executing Spark from symlinks (reopen)	jerryshao	2015-11-04	31	-183/+213
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR is based on the work of roji to support running Spark scripts from symlinks. Thanks for the great work roji . Would you mind taking a look at this PR, thanks a lot. For releases like HDP and others, normally it will expose the Spark executables as symlinks and put in `PATH`, but current Spark's scripts do not support finding real path from symlink recursively, this will make spark fail to execute from symlink. This PR try to solve this issue by finding the absolute path from symlink. Instead of using `readlink -f` like what this PR (https://github.com/apache/spark/pull/2386) implemented is that `-f` is not support for Mac, so here manually seeking the path through loop. I've tested with Mac and Linux (Cent OS), looks fine. This PR did not fix the scripts under `sbin` folder, not sure if it needs to be fixed also? Please help to review, any comment is greatly appreciated. Author: jerryshao <sshao@hortonworks.com> Author: Shay Rojansky <roji@roji.org> Closes #8669 from jerryshao/SPARK-2960.
*	[SPARK-11455][SQL] fix case sensitivity of partition by	Wenchen Fan	2015-11-03	4	-11/+39
\| \| \| \| \| \| \| \|	depend on `caseSensitive` to do column name equality check, instead of just `==` Author: Wenchen Fan <wenchen@databricks.com> Closes #9410 from cloud-fan/partition.
*	[SPARK-11329] [SQL] Cleanup from spark-11329 fix.	Nong	2015-11-03	4	-52/+55
\| \| \| \| \| \|	Author: Nong <nong@cloudera.com> Closes #9442 from nongli/spark-11483.
*	[DOC] Missing link to R DataFrame API doc	lewuathe	2015-11-03	2	-9/+98
\| \| \| \| \| \| \|	Author: lewuathe <lewuathe@me.com> Author: Lewuathe <lewuathe@me.com> Closes #9394 from Lewuathe/missing-link-to-R-dataframe.
*	[SPARK-11489][SQL] Only include common first order statistics in GroupedData	Reynold Xin	2015-11-03	3	-207/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We added a bunch of higher order statistics such as skewness and kurtosis to GroupedData. I don't think they are common enough to justify being listed, since users can always use the normal statistics aggregate functions. That is to say, after this change, we won't support ```scala df.groupBy("key").kurtosis("colA", "colB") ``` However, we will still support ```scala df.groupBy("key").agg(kurtosis(col("colA")), kurtosis(col("colB"))) ``` Author: Reynold Xin <rxin@databricks.com> Closes #9446 from rxin/SPARK-11489.
*	[SPARK-11466][CORE] Avoid mockito in multi-threaded FsHistoryProviderSuite test.	Marcelo Vanzin	2015-11-03	2	-39/+34
\| \| \| \| \| \| \| \| \| \| \|	The test functionality should be the same, but without using mockito; logs don't really say anything useful but I suspect it may be the cause of the flakiness, since updating mocks when multiple threads may be using it doesn't work very well. It also allows some other cleanup (= less test code in FsHistoryProvider). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9425 from vanzin/SPARK-11466.
*	Fix typo in WebUI	Jacek Laskowski	2015-11-03	1	-1/+1
\| \| \| \| \| \|	Author: Jacek Laskowski <jacek.laskowski@deepsense.io> Closes #9444 from jaceklaskowski/TImely-fix.
*	[SPARK-11477] [SQL] support create Dataset from RDD	Wenchen Fan	2015-11-04	3	-0/+20
\| \| \| \| \| \| \| \|	Author: Wenchen Fan <wenchen@databricks.com> Closes #9434 from cloud-fan/rdd2ds and squashes the following commits: 0892d72 [Wenchen Fan] support create Dataset from RDD
*	[SPARK-11467][SQL] add Python API for stddev/variance	Davies Liu	2015-11-03	3	-67/+105
\| \| \| \| \| \| \| \|	Add Python API for stddev/stddev_pop/stddev_samp/variance/var_pop/var_samp/skewness/kurtosis Author: Davies Liu <davies@databricks.com> Closes #9424 from davies/py_var.
*	[SPARK-11407][SPARKR] Add doc for running from RStudio	felixcheung	2015-11-03	1	-3/+43
\| \| \| \| \| \| \| \| \| \|	![image](https://cloud.githubusercontent.com/assets/8969467/10871746/612ba44a-80a4-11e5-99a0-40b9931dee52.png) (This is without css, but you get the idea) shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #9401 from felixcheung/rstudioprogrammingguide.
*	[SPARK-10978][SQL] Allow data sources to eliminate filters	Cheng Lian	2015-11-03	6	-68/+315
\| \| \| \| \| \| \| \|	This PR adds a new method `unhandledFilters` to `BaseRelation`. Data sources which implement this method properly may avoid the overhead of defensive filtering done by Spark SQL. Author: Cheng Lian <lian@databricks.com> Closes #9399 from liancheng/spark-10978.unhandled-filters.
*	[SPARK-9790][YARN] Expose in WebUI if NodeManager is the reason why ↵	Mark Grover	2015-11-03	8	-17/+29
\| \| \| \| \| \| \| \|	executors were killed. Author: Mark Grover <grover.markgrover@gmail.com> Closes #8093 from markgrover/nm2.
*	[SPARK-11349][ML] Support transform string label for RFormula	Yanbo Liang	2015-11-03	2	-1/+28
\| \| \| \| \| \| \| \| \|	Currently ```RFormula``` can only handle label with ```NumericType``` or ```BinaryType``` (cast it to ```DoubleType``` as the label of Linear Regression training), we should also support label of ```StringType``` which is needed for Logistic Regression (glm with family = "binomial"). For label of ```StringType```, we should use ```StringIndexer``` to transform it to 0-based index. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9302 from yanboliang/spark-11349.
*	[MINOR][ML] Fix naming conventions of AFTSurvivalRegression coefficients	Yanbo Liang	2015-11-03	2	-25/+25
\| \| \| \| \| \| \| \| \|	Rename ```regressionCoefficients``` back to ```coefficients```, and name ```weights``` to ```parameters```. See discussion [here](https://github.com/apache/spark/pull/9311/files#diff-e277fd0bc21f825d3196b4551c01fe5fR230). mengxr vectorijk dbtsai Author: Yanbo Liang <ybliang8@gmail.com> Closes #9431 from yanboliang/aft-coefficients.
*	[SPARK-9836][ML] Provide R-like summary statistics for OLS via normal ↵	Yanbo Liang	2015-11-03	4	-7/+243
\| \| \| \| \| \| \| \| \| \|	equation solver https://issues.apache.org/jira/browse/SPARK-9836 Author: Yanbo Liang <ybliang8@gmail.com> Closes #9413 from yanboliang/spark-9836.
*	[SPARK-10304] [SQL] Partition discovery should throw an exception if the dir ↵	Liang-Chi Hsieh	2015-11-03	2	-13/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	structure is invalid JIRA: https://issues.apache.org/jira/browse/SPARK-10304 This patch detects if the structure of partition directories is not valid. The test cases are from #8547. Thanks zhzhan. cc liancheng Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8840 from viirya/detect_invalid_part_dir.
*	[SPARK-11256] Mark all Stage/ResultStage/ShuffleMapStage internal state as ↵	Reynold Xin	2015-11-03	4	-38/+80
\| \| \| \| \| \| \| \|	private. Author: Reynold Xin <rxin@databricks.com> Closes #9219 from rxin/stage-cleanup1.
*	[SPARK-10533][SQL] handle scientific notation in sqlParser	Daoyuan Wang	2015-11-03	3	-5/+32
\| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-10533 val df = sqlContext.createDataFrame(Seq(("a",1.0),("b",2.0),("c",3.0))) df.filter("_2 < 2.0e1").show Scientific notation didn't work. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9085 from adrian-wang/scinotation.
*	[SPARK-11344] Made ApplicationDescription and DriverDescription case classes	Jacek Lewandowski	2015-11-03	7	-46/+34
\| \| \| \| \| \| \| \| \| \| \| \|	DriverDescription refactored to case class because it included no mutable fields. ApplicationDescription had one mutable field, which was appUiUrl. This field was set by the driver to point to the driver web UI. Master was modifying this field when the application was removed to redirect requests to history server. This was wrong because objects which are sent over the wire should be immutable. Now appUiUrl is immutable in ApplicationDescription and always points to the driver UI even if it is already shutdown. The UI url which master exposes to the user and modifies dynamically is now included into ApplicationInfo - a data object which describes the application state internally in master. That URL in ApplicationInfo is initialised with the value from ApplicationDescription. ApplicationDescription also included value user, which is now a part of case class fields. Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #9299 from jacek-lewandowski/SPARK-11344.
*	[SPARK-11404] [SQL] Support for groupBy using column expressions	Michael Armbrust	2015-11-03	3	-6/+106
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds a new method `groupBy(cols: Column*)` to `Dataset` that allows users to group using column expressions instead of a lambda function. Since the return type of these expressions is not known at compile time, we just set the key type as a generic `Row`. If the user would like to work the key in a type-safe way, they can call `grouped.asKey[Type]`, which is also added in this PR. ```scala val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS() val grouped = ds.groupBy($"_1").asKey[String] val agged = grouped.mapGroups { case (g, iter) => Iterator((g, iter.map(_._2).sum)) } agged.collect() res0: Array(("a", 30), ("b", 3), ("c", 1)) ``` Author: Michael Armbrust <michael@databricks.com> Closes #9359 from marmbrus/columnGroupBy and squashes the following commits: bbcb03b [Michael Armbrust] Update DatasetSuite.scala 8fd2908 [Michael Armbrust] Update DatasetSuite.scala 0b0e2f8 [Michael Armbrust] [SPARK-11404] [SQL] Support for groupBy using column expressions
*	[SPARK-11436] [SQL] rebind right encoder when join 2 datasets	Wenchen Fan	2015-11-03	2	-1/+11
\| \| \| \| \| \| \| \| \| \|	When we join 2 datasets, we will combine 2 encoders into a tupled one, and use it as the encoder for the jioned dataset. Assume both of the 2 encoders are flat, their `constructExpression`s both reference to the first element of input row. However, when we combine 2 encoders, the schema of input row changed, now the right encoder should reference to second element of input row. So we should rebind right encoder to let it know the new schema of input row before combine it. Author: Wenchen Fan <wenchen@databricks.com> Closes #9391 from cloud-fan/join and squashes the following commits: 846d3ab [Wenchen Fan] rebind right encoder when join 2 datasets
*	[SPARK-10429] [SQL] make mutableProjection atomic	Davies Liu	2015-11-03	3	-98/+97
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Right now, SQL's mutable projection updates every value of the mutable project after it evaluates the corresponding expression. This makes the behavior of MutableProjection confusing and complicate the implementation of common aggregate functions like stddev because developers need to be aware that when evaluating {{i+1}}th expression of a mutable projection, {{i}}th slot of the mutable row has already been updated. This PR make the MutableProjection atomic, by generating all the results of expressions first, then copy them into mutableRow. Had run a mircro-benchmark, there is no notable performance difference between using class members and local variables. cc yhuai Author: Davies Liu <davies@databricks.com> Closes #9422 from davies/atomic_mutable and squashes the following commits: bbc1758 [Davies Liu] support wide table 8a0ae14 [Davies Liu] fix bug bec07da [Davies Liu] refactor 2891628 [Davies Liu] make mutableProjection atomic
*	[SPARK-9858][SPARK-9859][SPARK-9861][SQL] Add an ExchangeCoordinator to ↵	Yin Huai	2015-11-03	9	-44/+1115
\| \| \| \| \| \| \| \| \| \| \| \|	estimate the number of post-shuffle partitions for aggregates and joins https://issues.apache.org/jira/browse/SPARK-9858 https://issues.apache.org/jira/browse/SPARK-9859 https://issues.apache.org/jira/browse/SPARK-9861 Author: Yin Huai <yhuai@databricks.com> Closes #9276 from yhuai/numReducer.
*	[SPARK-9034][SQL] Reflect field names defined in GenericUDTF	navis.ryu	2015-11-02	16	-17/+34
\| \| \| \| \| \| \| \| \| \|	Hive GenericUDTF#initialize() defines field names in a returned schema though, the current HiveGenericUDTF drops these names. We might need to reflect these in a logical plan tree. Author: navis.ryu <navis@apache.org> Closes #8456 from navis/SPARK-9034.
*	[SPARK-11469][SQL] Allow users to define nondeterministic udfs.	Yin Huai	2015-11-02	6	-78/+262
\| \| \| \| \| \| \| \|	This is the first task (https://issues.apache.org/jira/browse/SPARK-11469) of https://issues.apache.org/jira/browse/SPARK-11438 Author: Yin Huai <yhuai@databricks.com> Closes #9393 from yhuai/udfNondeterministic.
*	[SPARK-11432][GRAPHX] Personalized PageRank shouldn't use uniform initialization	Yves Raimond	2015-11-02	2	-15/+27
\| \| \| \| \| \| \| \|	Changes the personalized pagerank initialization to be non-uniform. Author: Yves Raimond <yraimond@netflix.com> Closes #9386 from moustaki/personalized-pagerank-init.
*	[SPARK-11329][SQL] Support star expansion for structs.	Nong Li	2015-11-02	6	-38/+230
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. Supporting expanding structs in Projections. i.e. "SELECT s." where s is a struct type. This is fixed by allowing the expand function to handle structs in addition to tables. 2. Supporting expanding inside aggregate functions of structs. "SELECT max(struct(col1, structCol.*))" This requires recursively expanding the expressions. In this case, it it the aggregate expression "max(...)" and we need to recursively expand its children inputs. Author: Nong Li <nongli@gmail.com> Closes #9343 from nongli/spark-11329.
*	[SPARK-5354][SQL] Cached tables should preserve partitioning and ord…	Nong Li	2015-11-02	3	-9/+97
\| \| \| \| \| \| \| \| \| \| \|	…ering. For cached tables, we can just maintain the partitioning and ordering from the source relation. Author: Nong Li <nongli@gmail.com> Closes #9404 from nongli/spark-5354.
*	[MINOR][ML] removed the old `getModelWeights` function	DB Tsai	2015-11-02	1	-10/+0
\| \| \| \| \| \| \| \|	Removed the old `getModelWeights` function which was private and renamed into `getModelCoefficients` Author: DB Tsai <dbt@netflix.com> Closes #9426 from dbtsai/feature-minor.
*	[SPARK-11236] [TEST-MAVEN] [TEST-HADOOP1.0] [CORE] Update Tachyon dependency ↵	Calvin Jia	2015-11-02	2	-9/+5
\| \| \| \| \| \| \| \| \| \| \| \|	0.7.1 -> 0.8.1 This is a reopening of #9204 which failed hadoop1 sbt tests. With the original PR, a classpath issue would occur due to the MIMA plugin pulling in hadoop-2.2 dependencies regardless of the hadoop version when building the `oldDeps` project. These affect the hadoop1 sbt build because they are placed in `lib_managed` and Tachyon 0.8.0's default hadoop version is 2.2. Author: Calvin Jia <jia.calvin@gmail.com> Closes #9395 from calvinjia/spark-11236.
*	[SPARK-10592] [ML] [PySpark] Deprecate weights and use coefficients instead ↵	vectorijk	2015-11-02	14	-211/+263
\| \| \| \| \| \| \| \| \| \|	in ML models Deprecated in `LogisticRegression` and `LinearRegression` Author: vectorijk <jiangkai@gmail.com> Closes #9311 from vectorijk/spark-10592.
*	[SPARK-11343][ML] Allow float and double prediction/label columns in ↵	Dominik Dahlem	2015-11-02	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \|	RegressionEvaluator mengxr, felixcheung This pull request just relaxes the type of the prediction/label columns to be float and double. Internally, these columns are casted to double. The other evaluators might need to be changed also. Author: Dominik Dahlem <dominik.dahlem@gmail.combination> Closes #9296 from dahlem/ddahlem_regression_evaluator_double_predictions_27102015.
*	[SPARK-10286][ML][PYSPARK][DOCS] Add @since annotation to pyspark.ml.param ↵	lihao	2015-11-02	4	-0/+230
\| \| \| \| \| \| \| \|	and pyspark.ml.* Author: lihao <lihaowhu@gmail.com> Closes #9275 from lidinghao/SPARK-10286.
*	[SPARK-11383][DOCS] Replaced example code in ↵	Rishabh Bhardwaj	2015-11-02	8	-207/+391
\| \| \| \| \| \| \| \| \| \| \|	mllib-naive-bayes.md/mllib-isotonic-regression.md using include_example I have made the required changes in mllib-naive-bayes.md/mllib-isotonic-regression.md and also verified them. Kindle Review it. Author: Rishabh Bhardwaj <rbnext29@gmail.com> Closes #9353 from rishabhbhardwaj/SPARK-11383.
*	[SPARK-11371] Make "mean" an alias for "avg" operator	tedyu	2015-11-02	2	-0/+10
\| \| \| \| \| \| \| \| \| \| \|	From Reynold in the thread 'Exception when using some aggregate operators' (http://search-hadoop.com/m/q3RTt0xFr22nXB4/): I don't think these are bugs. The SQL standard for average is "avg", not "mean". Similarly, a distinct count is supposed to be written as "count(distinct col)", not "countDistinct(col)". We can, however, make "mean" an alias for "avg" to improve compatibility between DataFrame and SQL. Author: tedyu <yuzhihong@gmail.com> Closes #9332 from ted-yu/master.
*	[SPARK-11358][MLLIB] deprecate runs in k-means	Xiangrui Meng	2015-11-02	2	-2/+6
\| \| \| \| \| \| \| \| \| \|	This PR deprecates `runs` in k-means. `runs` introduces extra complexity and overhead in MLlib's k-means implementation. I haven't seen much usage with `runs` not equal to `1`. We don't have a unit test for it either. We can deprecate this method in 1.6, and void it in 1.7. It helps us simplify the implementation. cc: srowen Author: Xiangrui Meng <meng@databricks.com> Closes #9322 from mengxr/SPARK-11358.
*	[SPARK-11456][TESTS] Remove deprecated junit.framework in Java tests	Sean Owen	2015-11-02	4	-83/+84
\| \| \| \| \| \| \| \|	Replace use of `junit.framework` with `org.junit`, and touch up tests in question Author: Sean Owen <sowen@cloudera.com> Closes #9411 from srowen/SPARK-11456.
*	[SPARK-11437] [PYSPARK] Don't .take when converting RDD to DataFrame with ↵	Jason White	2015-11-02	1	-7/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	provided schema When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls `.take(10)` to verify the first 10 rows of the RDD match the provided schema. Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue affected cases where a schema was not provided. Verifying the first 10 rows is of limited utility and causes the DAG to be executed non-lazily. If necessary, I believe this verification should be done lazily on all rows. However, since the caller is providing a schema to follow, I think it's acceptable to simply fail if the schema is incorrect. marmbrus We chatted about this at SparkSummitEU. davies you made a similar change for the infer-schema path in https://github.com/apache/spark/pull/6606 Author: Jason White <jason.white@shopify.com> Closes #9392 from JasonMWhite/createDataFrame_without_take.
*	[SPARK-10997][CORE] Add "client mode" to netty rpc env.	Marcelo Vanzin	2015-11-02	17	-190/+266
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	"Client mode" means the RPC env will not listen for incoming connections. This allows certain processes in the Spark stack (such as Executors or tha YARN client-mode AM) to act as pure clients when using the netty-based RPC backend, reducing the number of sockets needed by the app and also the number of open ports. Client connections are also preferred when endpoints that actually have a listening socket are involved; so, for example, if a Worker connects to a Master and the Master needs to send a message to a Worker endpoint, that client connection will be used, even though the Worker is also listening for incoming connections. With this change, the workaround for SPARK-10987 isn't necessary anymore, and is removed. The AM connects to the driver in "client mode", and that connection is used for all driver <-> AM communication, and so the AM is properly notified when the connection goes down. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9210 from vanzin/SPARK-10997.
*	[SPARK-9817][YARN] Improve the locality calculation of containers by taking ↵	jerryshao	2015-11-02	5	-40/+159
\| \| \| \| \| \| \| \| \| \| \| \|	pending container requests into consideraion This is a follow-up PR to further improve the locality calculation by considering the pending container's request. Since the locality preferences of tasks may be shifted from time to time, current localities of pending container requests may not fully match the new preferences, this PR improve it by removing outdated, unmatched container requests and replace with new requests. sryza please help to review, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #8100 from jerryshao/SPARK-9817.
*	[SPARK-11311][SQL] spark cannot describe temporary functions	Daoyuan Wang	2015-11-02	2	-1/+15
\| \| \| \| \| \| \| \|	When describe temporary function, spark would return 'Unable to find function', this is not right. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9277 from adrian-wang/functionreg.
*	[SPARK-10786][SQL] Take the whole statement to generate the CommandProcessor	huangzhaowei	2015-11-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the now implementation of `SparkSQLCLIDriver.scala`: `val proc: CommandProcessor = CommandProcessorFactory.get(Array(tokens(0)), hconf)` `CommandProcessorFactory` only take the first token of the statement, and this will be hard to diff the statement `delete jar xxx` and `delete from xxx`. So maybe it's better to take the whole statement into the `CommandProcessorFactory`. And in [HiveCommand](https://github.com/SaintBacchus/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/processors/HiveCommand.java#L76), it already special handing these two statement. ```java if(command.length > 1 && "from".equalsIgnoreCase(command[1])) { //special handling for SQL "delete from <table> where..." return null; } ``` Author: huangzhaowei <carlmartinmax@gmail.com> Closes #8895 from SaintBacchus/SPARK-10786.
*	[SPARK-11413][BUILD] Bump joda-time version to 2.9 for java 8 and s3	Yongjia Wang	2015-11-02	1	-1/+1
\| \| \| \| \| \| \| \| \|	It's a known issue that joda-time before 2.8.1 is incompatible with java 1.8u60 or later, which causes s3 request to fail. This affects Spark when using s3 as data source. https://github.com/aws/aws-sdk-java/issues/444 Author: Yongjia Wang <yongjiaw@gmail.com> Closes #9379 from yongjiaw/SPARK-11413.