spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-7118] [Python] Add the coalesce Spark SQL function available in PySpark	Olivier Girardot	2015-05-07	1	-0/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds a proxy call from PySpark to the Spark SQL coalesce function and this patch comes out of a discussion on devspark with rxin This contribution is my original work and i license the work to the project under the project's open source license. Olivier. Author: Olivier Girardot <o.girardot@lateral-thoughts.com> Closes #5698 from ogirardot/master and squashes the following commits: d9a4439 [Olivier Girardot] SPARK-7118 Add the coalesce Spark SQL function available in PySpark
*	[SPARK-7388] [SPARK-7383] wrapper for VectorAssembler in Python	Burak Yavuz	2015-05-07	8	-13/+105
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The wrapper required the implementation of the `ArrayParam`, because `Array[T]` is hard to obtain from Python. `ArrayParam` has an extra function called `wCast` which is an internal function to obtain `Array[T]` from `Seq[T]` Author: Burak Yavuz <brkyvz@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #5930 from brkyvz/ml-feat and squashes the following commits: 73e745f [Burak Yavuz] Merge pull request #3 from mengxr/SPARK-7388 c221db9 [Xiangrui Meng] overload StringArrayParam.w c81072d [Burak Yavuz] addressed comments 99c2ebf [Burak Yavuz] add to python_shared_params 39ecb07 [Burak Yavuz] fix scalastyle 7f7ea2a [Burak Yavuz] [SPARK-7388][SPARK-7383] wrapper for VectorAssembler in Python
*	[SPARK-7330] [SQL] avoid NPE at jdbc rdd	Daoyuan Wang	2015-05-07	2	-1/+32
\| \| \| \| \| \| \| \| \| \|	Thank nadavoosh point this out in #5590 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #5877 from adrian-wang/jdbcrdd and squashes the following commits: cc11900 [Daoyuan Wang] avoid NPE in jdbcrdd
*	[SPARK-7429] [ML] Params cleanups	Joseph K. Bradley	2015-05-07	3	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Params.setDefault taking a set of ParamPairs should be annotated with varargs. I thought it would not work before, but it apparently does. CrossValidator.transform should call transformSchema since the underlying Model might be a PipelineModel CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #5960 from jkbradley/params-cleanups and squashes the following commits: 118b158 [Joseph K. Bradley] Params.setDefault taking a set of ParamPairs should be annotated with varargs. I thought it would not work before, but it apparently does. CrossValidator.transform should call transformSchema since the underlying Model might be a PipelineModel
*	[SPARK-7421] [MLLIB] OnlineLDA cleanups	Joseph K. Bradley	2015-05-07	4	-28/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Small changes, primarily to allow us more flexibility in the future: * Rename "tau_0" to "tau0" * Mark LDAOptimizer trait sealed and DeveloperApi. * Mark LDAOptimizer subclasses as final. * Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as DeveloperApi since we may need to change them in the future CC: hhbyyh Author: Joseph K. Bradley <joseph@databricks.com> Closes #5956 from jkbradley/onlinelda-cleanups and squashes the following commits: f4be508 [Joseph K. Bradley] added newline f4003e4 [Joseph K. Bradley] Changes: * Rename "tau_0" to "tau0" * Mark LDAOptimizer trait sealed and DeveloperApi. * Mark LDAOptimizer subclasses as final. * Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as DeveloperApi since we may need to change them in the future
*	[SPARK-7035] Encourage __getitem__ over __getattr__ on column access in the ↵	ksonj	2015-05-07	1	-3/+8
\| \| \| \| \| \| \| \| \| \|	Python DataFrame API Author: ksonj <kson@siberie.de> Closes #5971 from ksonj/doc and squashes the following commits: dadfebb [ksonj] __getitem__ is cleaner than __getattr__
*	[SPARK-7295][SQL] bitwise operations for DataFrame DSL	Shiti	2015-05-07	7	-2/+97
\| \| \| \| \| \| \| \|	Author: Shiti <ssaxena.ece@gmail.com> Closes #5867 from Shiti/spark-7295 and squashes the following commits: 71a9913 [Shiti] implementation for bitwise and,or, not and xor on Column with tests and docs
*	[SPARK-7217] [STREAMING] Add configuration to control the default behavior ↵	Tathagata Das	2015-05-07	2	-5/+24
\| \| \| \| \| \| \| \| \| \| \| \| \|	of StreamingContext.stop() implicitly calling SparkContext.stop() In environments like notebooks, the SparkContext is managed by the underlying infrastructure and it is expected that the SparkContext will not be stopped. However, StreamingContext.stop() calls SparkContext.stop() as a non-intuitive side-effect. This PR adds a configuration in SparkConf that sets the default StreamingContext stop behavior. It should be such that the existing behavior does not change for existing users. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #5929 from tdas/SPARK-7217 and squashes the following commits: 869a763 [Tathagata Das] Changed implementation. 685fe00 [Tathagata Das] Added configuration
*	[SPARK-7430] [STREAMING] [TEST] General improvements to streaming tests to ↵	Tathagata Das	2015-05-07	1	-12/+21
\| \| \| \| \| \| \| \| \| \| \| \|	increase debuggability Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #5961 from tdas/SPARK-7430 and squashes the following commits: d654978 [Tathagata Das] Fix scala style fbf7174 [Tathagata Das] Added more verbose assert failure messages. 6aea07a [Tathagata Das] Ensure SynchronizedBuffer is used in every TestSuiteBase
*	[SPARK-5938] [SPARK-5443] [SQL] Improve JsonRDD performance	Nathan Howell	2015-05-06	13	-128/+715
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch comprises of a few related pieces of work: * Schema inference is performed directly on the JSON token stream * `String => Row` conversion populate Spark SQL structures without intermediate types * Projection pushdown is implemented via CatalystScan for DataFrame queries * Support for the legacy parser by setting `spark.sql.json.useJacksonStreamingAPI` to `false` Performance improvements depend on the schema and queries being executed, but it should be faster across the board. Below are benchmarks using the last.fm Million Song dataset: ``` Command \| Baseline \| Patched ---------------------------------------------------\|----------\|-------- import sqlContext.implicits._ \| \| val df = sqlContext.jsonFile("/tmp/lastfm.json") \| 70.0s \| 14.6s df.count() \| 28.8s \| 6.2s df.rdd.count() \| 35.3s \| 21.5s df.where($"artist" === "Robert Hood").collect() \| 28.3s \| 16.9s ``` To prepare this dataset for benchmarking, follow these steps: ``` # Fetch the datasets from http://labrosa.ee.columbia.edu/millionsong/lastfm wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_test.zip \ http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_train.zip # Decompress and combine, pipe through `jq -c` to ensure there is one record per line unzip -p lastfm_test.zip lastfm_train.zip \| jq -c . > lastfm.json ``` Author: Nathan Howell <nhowell@godaddy.com> Closes #5801 from NathanHowell/json-performance and squashes the following commits: 26fea31 [Nathan Howell] Recreate the baseRDD each for each scan operation a7ebeb2 [Nathan Howell] Increase coverage of inserts into a JSONRelation e06a1dd [Nathan Howell] Add comments to the `useJacksonStreamingAPI` config flag 6822712 [Nathan Howell] Split up JsonRDD2 into multiple objects fa8234f [Nathan Howell] Wrap long lines b31917b [Nathan Howell] Rename `useJsonRDD2` to `useJacksonStreamingAPI` 15c5d1b [Nathan Howell] JSONRelation's baseRDD need not be lazy f8add6e [Nathan Howell] Add comments on lack of support for precision and scale DecimalTypes fa0be47 [Nathan Howell] Remove unused default case in the field parser 80dba17 [Nathan Howell] Add comments regarding null handling and empty strings 842846d [Nathan Howell] Point the empty schema inference test at JsonRDD2 ab6ee87 [Nathan Howell] Add projection pushdown support to JsonRDD/JsonRDD2 f636c14 [Nathan Howell] Enable JsonRDD2 by default, add a flag to switch back to JsonRDD 0bbc445 [Nathan Howell] Improve JSON parsing and type inference performance 7ca70c1 [Nathan Howell] Eliminate arrow pattern, replace with pattern matches
*	[SPARK-6812] [SPARKR] filter() on DataFrame does not work as expected.	Sun Rui	2015-05-06	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	According to the R manual: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html, " if a function .First is found on the search path, it is executed as .First(). Finally, function .First.sys() in the base package is run. This calls require to attach the default packages specified by options("defaultPackages")." In .First() in profile/shell.R, we load SparkR package. This means SparkR package is loaded before default packages. If there are same names in default packages, they will overwrite those in SparkR. This is why filter() in SparkR is masked by filter() in stats, which is usually in the default package list. We need to make sure SparkR is loaded after default packages. The solution is to append SparkR to default packages, instead of loading SparkR in .First(). BTW, I'd like to discuss our policy on how to solve name conflict. Previously, we rename API names from Scala API if there is name conflict with base or other commonly-used packages. However, from long term perspective, this is not good for API stability, because we can't predict name conflicts, for example, if in the future a name added in base package conflicts with an API in SparkR? So the better policy is to keep API name same as Scala's without worrying about name conflicts. When users use SparkR, they should load SparkR as last package, so that all API names are effective. Use can explicitly use :: to refer to hidden names from other packages. If we agree on this, I can submit a JIRA issue to change back some rename API methods, for example, DataFrame.sortDF(). Author: Sun Rui <rui.sun@intel.com> Closes #5938 from sun-rui/SPARK-6812 and squashes the following commits: b569145 [Sun Rui] [SPARK-6812][SparkR] filter() on DataFrame does not work as expected.
*	[SPARK-7432] [MLLIB] disable cv doctest	Xiangrui Meng	2015-05-06	1	-4/+4
\| \| \| \| \| \| \| \| \| \|	Temporarily disable flaky doctest for CrossValidator. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #5962 from mengxr/disable-pyspark-cv-test and squashes the following commits: 5db7e5b [Xiangrui Meng] disable cv doctest
*	[SPARK-7405] [STREAMING] Fix the bug that ReceiverInputDStream doesn't ↵	zsxwing	2015-05-06	2	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	report InputInfo The bug is because SPARK-7139 removed some codes from SPARK-7112 unintentionally here: https://github.com/apache/spark/commit/1854ac326a9cc6014817d8df30ed0458eee5d7d1#diff-5c8651dd78abd20439b8eb938175075dL72 This PR just added them back and added some assertions in the tests to verify it. Author: zsxwing <zsxwing@gmail.com> Closes #5950 from zsxwing/SPARK-7405 and squashes the following commits: 675f5d9 [zsxwing] Fix the bug that ReceiverInputDStream doesn't report InputInfo
*	[HOT FIX] For DAG visualization #5954	Andrew Or	2015-05-06	1	-2/+2
\|
*	[SPARK-7371] [SPARK-7377] [SPARK-7408] DAG visualization addendum (#5729)	Andrew Or	2015-05-06	5	-177/+310
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a follow-up patch for #5729. [SPARK-7408] Move as much style code from JS to CSS as possible [SPARK-7377] Fix JS error if a job / stage contains only one RDD [SPARK-7371] Decrease emphasis on RDD on stage page as requested by mateiz pwendell This patch also includes general code clean up. <img src="https://issues.apache.org/jira/secure/attachment/12730992/before-after.png" width="500px"></img> Author: Andrew Or <andrew@databricks.com> Closes #5954 from andrewor14/viz-emphasize-rdd and squashes the following commits: 3c0d4f0 [Andrew Or] Guard against JS error by rendering arrows only if needed f23e15b [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz-emphasize-rdd 565801f [Andrew Or] Clean up code 9dab5f0 [Andrew Or] Move styling from JS to CSS + clean up code 107c0b6 [Andrew Or] Tweak background color, stroke width, font size etc. 1610c62 [Andrew Or] Implement cluster padding for stage page
*	[SPARK-7396] [STREAMING] [EXAMPLE] Update KafkaWordCountProducer to use new ↵	jerryshao	2015-05-06	1	-11/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Producer API Otherwise it will throw exception: ``` Exception in thread "main" kafka.common.FailedToSendMessageException: Failed to send messages after 3 tries. at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90) at kafka.producer.Producer.send(Producer.scala:77) at org.apache.spark.examples.streaming.KafkaWordCountProducer$.main(KafkaWordCount.scala:96) at org.apache.spark.examples.streaming.KafkaWordCountProducer.main(KafkaWordCount.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:623) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` Author: jerryshao <saisai.shao@intel.com> Closes #5936 from jerryshao/SPARK-7396 and squashes the following commits: 270bbe2 [jerryshao] Fix Kafka Produce throw Exception issue
*	[SPARK-6799] [SPARKR] Remove SparkR RDD examples, add dataframe examples	Shivaram Venkataraman	2015-05-06	8	-355/+55
\| \| \| \| \| \| \| \| \| \| \| \|	This PR also makes some of the DataFrame to RDD methods private as the RDD class is private in 1.4 cc rxin pwendell Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #5949 from shivaram/sparkr-examples and squashes the following commits: 6c42fdc [Shivaram Venkataraman] Remove SparkR RDD examples, add dataframe examples
*	[HOT FIX] [SPARK-7418] Ignore flaky SparkSubmitUtilsSuite test	Andrew Or	2015-05-06	1	-1/+1
\|
*	[SPARK-5995] [ML] Make Prediction dev API public	Joseph K. Bradley	2015-05-06	16	-267/+206
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Changes: * Update protected prediction methods, following design doc. <--most interesting change * Changed abstract classes for Estimator and Model to be public. Added DeveloperApi tag. (I kept the traits for Estimator/Model Params private.) * Changed ProbabilisticClassificationModel method names to use probability instead of probabilities. CC: mengxr shivaram etrain Author: Joseph K. Bradley <joseph@databricks.com> Closes #5913 from jkbradley/public-dev-api and squashes the following commits: e9aa0ea [Joseph K. Bradley] moved findMax to DenseVector and renamed to argmax. fixed bug for vector of length 0 15b9957 [Joseph K. Bradley] renamed probabilities to probability in method names 5cda84d [Joseph K. Bradley] regenerated sharedParams 7d1877a [Joseph K. Bradley] Made spark.ml prediction abstractions public. Organized their prediction methods for efficient computation of multiple output columns.
*	[HOT-FIX] Move HiveWindowFunctionQuerySuite.scala to hive compatibility dir.	Yin Huai	2015-05-06	1	-0/+0
\| \| \| \| \| \| \| \|	Author: Yin Huai <yhuai@databricks.com> Closes #5951 from yhuai/fixBuildMaven and squashes the following commits: fdde183 [Yin Huai] Move HiveWindowFunctionQuerySuite.scala to hive compatibility dir.
*	Add `Private` annotation.	Josh Rosen	2015-05-06	1	-0/+41
\| \| \| \|	This was originally added as part of #4435, which was reverted.
*	[SPARK-7311] Introduce internal Serializer API for determining if ↵	Josh Rosen	2015-05-06	5	-3/+166
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	serializers support object relocation This patch extends the `Serializer` interface with a new `Private` API which allows serializers to indicate whether they support relocation of serialized objects in serializer stream output. This relocatibilty property is described in more detail in `Serializer.scala`, but in a nutshell a serializer supports relocation if reordering the bytes of serialized objects in serialization stream output is equivalent to having re-ordered those elements prior to serializing them. The optimized shuffle path introduced in #4450 and #5868 both rely on serializers having this property; this patch just centralizes the logic for determining whether a serializer has this property. I also added tests and comments clarifying when this works for KryoSerializer. This change allows the optimizations in #4450 to be applied for shuffles that use `SqlSerializer2`. Author: Josh Rosen <joshrosen@databricks.com> Closes #5924 from JoshRosen/SPARK-7311 and squashes the following commits: 50a68ca [Josh Rosen] Address minor nits 0a7ebd7 [Josh Rosen] Clarify reason why SqlSerializer2 supports this serializer 123b992 [Josh Rosen] Cleanup for submitting as standalone patch. 4aa61b2 [Josh Rosen] Add missing newline 2c1233a [Josh Rosen] Small refactoring of SerializerPropertiesSuite to enable test re-use: 0ba75e6 [Josh Rosen] Add tests for serializer relocation property. 450fa21 [Josh Rosen] Back out accidental log4j.properties change 86d4dcd [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation b9624ee [Josh Rosen] Expand serializer API and use new function to help control when new UnsafeShuffle path is used.
*	[SPARK-1442] [SQL] Window Function Support for Spark SQL	Yin Huai	2015-05-06	101	-34/+34768
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adding more information about the implementation... This PR is adding the support of window functions to Spark SQL (specifically OVER and WINDOW clause). For every expression having a OVER clause, we use a WindowExpression as the container of a WindowFunction and the corresponding WindowSpecDefinition (the definition of a window frame, i.e. partition specification, order specification, and frame specification appearing in a OVER clause). # Implementation # The high level work flow of the implementation is described as follows. * Query parsing: In the query parse process, all WindowExpressions are originally placed in the projectList of a Project operator or the aggregateExpressions of an Aggregate operator. It makes our changes to simple and keep all of parsing rules for window functions at a single place (nodesToWindowSpecification). For the WINDOWclause in a query, we use a WithWindowDefinition as the container as the mapping from the name of a window specification to a WindowSpecDefinition. This changes is similar with our common table expression support. * Analysis: The query analysis process has three steps for window functions. * Resolve all WindowSpecReferences by replacing them with WindowSpecReferences according to the mapping table stored in the node of WithWindowDefinition. * Resolve WindowFunctions in the projectList of a Project operator or the aggregateExpressions of an Aggregate operator. For this PR, we use Hive's functions for window functions because we will have a major refactoring of our internal UDAFs and it is better to switch our UDAFs after that refactoring work. * Once we have resolved all WindowFunctions, we will use ResolveWindowFunction to extract WindowExpressions from projectList and aggregateExpressions and then create a Window operator for every distinct WindowSpecDefinition. With this choice, at the execution time, we can rely on the Exchange operator to do all of work on reorganizing the table and we do not need to worry about it in the physical Window operator. An example analyzed plan is shown as follows ``` sql(""" SELECT year, country, product, sales, avg(sales) over(partition by product) avg_product, sum(sales) over(partition by country) sum_country FROM sales ORDER BY year, country, product """).explain(true) == Analyzed Logical Plan == Sort [year#34 ASC,country#35 ASC,product#36 ASC], true Project [year#34,country#35,product#36,sales#37,avg_product#27,sum_country#28] Window [year#34,country#35,product#36,sales#37,avg_product#27], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum(sales#37) WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS sum_country#28], WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Window [year#34,country#35,product#36,sales#37], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage(sales#37) WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS avg_product#27], WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Project [year#34,country#35,product#36,sales#37] MetastoreRelation default, sales, None ``` * Query planning: In the process of query planning, we simple generate the physical Window operator based on the logical Window operator. Then, to prepare the executedPlan, the EnsureRequirements rule will add Exchange and Sort operators if necessary. The EnsureRequirements rule will analyze the data properties and try to not add unnecessary shuffle and sort. The physical plan for the above example query is shown below. ``` == Physical Plan == Sort [year#34 ASC,country#35 ASC,product#36 ASC], true Exchange (RangePartitioning [year#34 ASC,country#35 ASC,product#36 ASC], 200), [] Window [year#34,country#35,product#36,sales#37,avg_product#27], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum(sales#37) WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS sum_country#28], WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Exchange (HashPartitioning [country#35], 200), [country#35 ASC] Window [year#34,country#35,product#36,sales#37], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage(sales#37) WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS avg_product#27], WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Exchange (HashPartitioning [product#36], 200), [product#36 ASC] HiveTableScan [year#34,country#35,product#36,sales#37], (MetastoreRelation default, sales, None), None ``` * Execution time: At execution time, a physical Window operator buffers all rows in a partition specified in the partition spec of a OVER clause. If necessary, it also maintains a sliding window frame. The current implementation tries to buffer the input parameters of a window function according to the window frame to avoid evaluating a row multiple times. # Future work # Here are three improvements that are not hard to add: * Taking advantage of the window frame specification to reduce the number of rows buffered in the physical Window operator. For some cases, we only need to buffer the rows appearing in the sliding window. But for other cases, we will not be able to reduce the number of rows buffered (e.g. ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING). * When aRAGEN frame is used, for <value> PRECEDING and <value> FOLLOWING, it will be great if the <value> part is an expression (we can start with Literal). So, when the data type of ORDER BY expression is a FractionalType, we can support FractionalType as the type <value> (<value> still needs to be evaluated as a positive value). * When aRAGEN frame is used, we need to support DateType and TimestampType as the data type of the expression appearing in the order specification. Then, the <value> part of <value> PRECEDING and <value> FOLLOWING can support interval types (once we support them). This is a joint work with guowei2 and yhuai Thanks hbutani hvanhovell for his comments Thanks scwf for his comments and unit tests Author: Yin Huai <yhuai@databricks.com> Closes #5604 from guowei2/windowImplement and squashes the following commits: 76fe1c8 [Yin Huai] Implementation. aa2b0ae [Yin Huai] Tests.
*	[SPARK-6201] [SQL] promote string and do widen types for IN	Daoyuan Wang	2015-05-06	3	-2/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	huangjs Acutally spark sql will first go through analysis period, in which we do widen types and promote strings, and then optimization, where constant IN will be converted into INSET. So it turn out that we only need to fix this for IN. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4945 from adrian-wang/inset and squashes the following commits: 71e05cc [Daoyuan Wang] minor fix 581fa1c [Daoyuan Wang] mysql way f3f7baf [Daoyuan Wang] address comments 5eed4bc [Daoyuan Wang] promote string and do widen types for IN
*	[SPARK-5456] [SQL] fix decimal compare for jdbc rdd	Daoyuan Wang	2015-05-06	2	-2/+11
\| \| \| \| \| \| \| \| \|	Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #5803 from adrian-wang/decimalcompare and squashes the following commits: aef0e96 [Daoyuan Wang] add null handle ec455b9 [Daoyuan Wang] fix decimal compare for jdbc rdd
*	[SQL] JavaDoc update for various DataFrame functions.	Reynold Xin	2015-05-06	4	-21/+32
\| \| \| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #5935 from rxin/df-doc1 and squashes the following commits: aaeaadb [Reynold Xin] [SQL] JavaDoc update for various DataFrame functions.
*	[SPARK-6940] [MLLIB] Add CrossValidator to Python ML pipeline API	Xiangrui Meng	2015-05-06	4	-8/+199
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since CrossValidator is a meta algorithm, we copy the implementation in Python. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #5926 from mengxr/SPARK-6940 and squashes the following commits: 6af181f [Xiangrui Meng] add TODOs 8285134 [Xiangrui Meng] update doc 060f7c3 [Xiangrui Meng] update doctest acac727 [Xiangrui Meng] add keyword args cdddecd [Xiangrui Meng] add CrossValidator in Python
*	[SPARK-7384][Core][Tests] Fix flaky tests for distributed mode in BroadcastSuite	zsxwing	2015-05-05	2	-4/+18
\| \| \| \| \| \| \| \| \| \| \| \| \|	Fixed the following failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/452/testReport/junit/org.apache.spark.broadcast/BroadcastSuite/Unpersisting_HttpBroadcast_on_executors_and_driver_in_distributed_mode/ The tests should wait until all slaves are up. Otherwise, there may be only a part of `BlockManager`s registered, and fail the tests. Author: zsxwing <zsxwing@gmail.com> Closes #5925 from zsxwing/SPARK-7384 and squashes the following commits: 783cb7b [zsxwing] Add comments for _jobProgressListener and remove postfixOps 1009ef1 [zsxwing] [SPARK-7384][Core][Tests] Fix flaky tests for distributed mode in BroadcastSuite
*	[SPARK-6267] [MLLIB] Python API for IsotonicRegression	Yanbo Liang	2015-05-05	3	-4/+106
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-6267 Author: Yanbo Liang <ybliang8@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #5890 from yanboliang/spark-6267 and squashes the following commits: f20541d [Yanbo Liang] Merge pull request #3 from mengxr/SPARK-6267 7f202f9 [Xiangrui Meng] use Vector to have the best Python 2&3 compatibility 4bccfee [Yanbo Liang] fix doctest ec09412 [Yanbo Liang] fix typos 8214bbb [Yanbo Liang] fix code style 5c8ebe5 [Yanbo Liang] Python API for IsotonicRegression
*	[SPARK-7358][SQL] Move DataFrame mathfunctions into functions	Burak Yavuz	2015-05-05	8	-492/+543
\| \| \| \| \| \| \| \| \| \| \| \| \|	After a discussion on the user mailing list, it was decided to put all UDF's under `o.a.s.sql.functions` cc rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5923 from brkyvz/move-math-funcs and squashes the following commits: a8dc3f7 [Burak Yavuz] address comments cf7a7bb [Burak Yavuz] [SPARK-7358] Move DataFrame mathfunctions into functions
*	[SPARK-6841] [SPARKR] add support for mean, median, stdev etc.	qhuang	2015-05-05	4	-0/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Moving here from https://github.com/amplab-extras/SparkR-pkg/pull/241 sum() has been implemented. (https://github.com/amplab-extras/SparkR-pkg/pull/242) Now Phase 1: mean, sd, var have been implemented, but some things still need to be improved with the suggestions in https://issues.apache.org/jira/browse/SPARK-6841 Author: qhuang <qian.huang@intel.com> Closes #5446 from hqzizania/R and squashes the following commits: f283572 [qhuang] add test unit for describe() 2e74d5a [qhuang] add describe() DataFrame API
*	Revert "[SPARK-3454] separate json endpoints for data in the UI"	Reynold Xin	2015-05-05	100	-19946/+172
\| \| \| \| \| \|	This reverts commit d49735800db27239c11478aac4b0f2ec9df91a3f. The commit broke Spark on Windows.
*	[SPARK-6231][SQL/DF] Automatically resolve join condition ambiguity for ↵	Reynold Xin	2015-05-05	6	-43/+170
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	self-joins. See the comment in join function for more information. Author: Reynold Xin <rxin@databricks.com> Closes #5919 from rxin/self-join-resolve and squashes the following commits: e2fb0da [Reynold Xin] Updated SQLConf comment. 7233a86 [Reynold Xin] Updated comment. 6be2b4d [Reynold Xin] Removed println 9f6b72f [Reynold Xin] [SPARK-6231][SQL/DF] Automatically resolve ambiguity in join condition for self-joins.
*	Some minor cleanup after SPARK-4550.	Sandy Ryza	2015-05-05	2	-9/+4
\| \| \| \| \| \| \| \| \| \|	JoshRosen this PR addresses the comments you left on #4450 after it got merged. Author: Sandy Ryza <sandy@cloudera.com> Closes #5916 from sryza/sandy-spark-4550-cleanup and squashes the following commits: dee3d85 [Sandy Ryza] Some minor cleanup after SPARK-4550.
*	[SPARK-7230] [SPARKR] Make RDD private in SparkR.	Shivaram Venkataraman	2015-05-05	6	-103/+26
\| \| \| \| \| \| \| \| \| \|	This change makes the RDD API private in SparkR and all internal uses of the SparkR API use SparkR::: to access private functions. Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #5895 from shivaram/rrdd-private and squashes the following commits: bdb2f07 [Shivaram Venkataraman] Make RDD private in SparkR. This change also makes all internal uses of the SparkR API use SparkR::: to access private functions
*	[SQL][Minor] make StringComparison extends ExpectsInputTypes	wangfei	2015-05-05	1	-7/+6
\| \| \| \| \| \| \| \| \| \|	make StringComparison extends ExpectsInputTypes and added expectedChildTypes, so do not need override expectedChildTypes in each subclass Author: wangfei <wangfei1@huawei.com> Closes #5905 from scwf/ExpectsInputTypes and squashes the following commits: b374ddf [wangfei] make stringcomparison extends ExpectsInputTypes
*	[SPARK-7351] [STREAMING] [DOCS] Add spark.streaming.ui.retainedBatches to docs	zsxwing	2015-05-05	1	-0/+7
\| \| \| \| \| \| \| \| \| \|	The default value will be changed to `1000` in #5533. So here I just used `1000`. Author: zsxwing <zsxwing@gmail.com> Closes #5899 from zsxwing/SPARK-7351 and squashes the following commits: e1ec515 [zsxwing] [SPARK-7351][Streaming][Docs] Add spark.streaming.ui.retainedBatches to docs
*	[SPARK-7294][SQL] ADD BETWEEN	云峤	2015-05-05	4	-0/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Author: 云峤 <chensong.cs@alibaba-inc.com> Author: kaka1992 <kaka_1992@163.com> Closes #5839 from kaka1992/master and squashes the following commits: b15360d [kaka1992] Fix python unit test in sql/test. =_= I forget to commit this file last time. f928816 [kaka1992] Fix python style in sql/test. d2e7f72 [kaka1992] Fix python style in sql/test. c54d904 [kaka1992] Fix empty map bug. 7e64d1e [云峤] Update 7b9b858 [云峤] undo f080f8d [云峤] update pep8 76f0c51 [云峤] Merge remote-tracking branch 'remotes/upstream/master' 7d62368 [云峤] [SPARK-7294] ADD BETWEEN baf839b [云峤] [SPARK-7294] ADD BETWEEN d11d5b9 [云峤] [SPARK-7294] ADD BETWEEN
*	[SPARK-6939] [STREAMING] [WEBUI] Add timeline and histogram graphs for ↵	zsxwing	2015-05-05	17	-244/+1228
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	streaming statistics This is the initial work of SPARK-6939. Not yet ready for code review. Here are the screenshots: ![graph1](https://cloud.githubusercontent.com/assets/1000778/7165766/465942e0-e3dc-11e4-9b05-c184b09d75dc.png) ![graph2](https://cloud.githubusercontent.com/assets/1000778/7165779/53f13f34-e3dc-11e4-8714-a4a75b7e09ff.png) TODOs: - [x] Display more information on mouse hover - [x] Align the timeline and distribution graphs - [x] Clean up the codes Author: zsxwing <zsxwing@gmail.com> Closes #5533 from zsxwing/SPARK-6939 and squashes the following commits: 9f7cd19 [zsxwing] Merge branch 'master' into SPARK-6939 deacc3f [zsxwing] Remove unused import cd03424 [zsxwing] Fix .rat-excludes 70cc87d [zsxwing] Streaming Scheduling Delay => Scheduling Delay d457277 [zsxwing] Fix UIUtils in BatchPage b3f303e [zsxwing] Add comments for unclear classes and methods ff0bff8 [zsxwing] Make InputDStream.name private[streaming] cc392c5 [zsxwing] Merge branch 'master' into SPARK-6939 e275e23 [zsxwing] Move time related methods to Streaming's UIUtils d5d86f6 [zsxwing] Fix incorrect lastErrorTime 3be4b7a [zsxwing] Use InputInfo b50fa32 [zsxwing] Jump to the batch page when clicking a point in the timeline graphs 203605d [zsxwing] Merge branch 'master' into SPARK-6939 74307cf [zsxwing] Reuse the data for histogram graphs to reduce the page size 2586916 [zsxwing] Merge branch 'master' into SPARK-6939 70d8533 [zsxwing] Remove BatchInfo.numRecords and a few renames 7bbdc0a [zsxwing] Hide the receiver sub table if no receiver a2972e9 [zsxwing] Add some ui tests for StreamingPage fd03ad0 [zsxwing] Add a test to verify no memory leak 4a8f886 [zsxwing] Merge branch 'master' into SPARK-6939 18607a1 [zsxwing] Merge branch 'master' into SPARK-6939 d0b0aec [zsxwing] Clean up the codes a459f49 [zsxwing] Add a dash line to processing time graphs 8e4363c [zsxwing] Prepare for the demo c81a1ee [zsxwing] Change time unit in the graphs automatically 4c0b43f [zsxwing] Update Streaming UI 04c7500 [zsxwing] Make the server and client use the same timezone fed8219 [zsxwing] Move the x axis at the top and show a better tooltip c23ce10 [zsxwing] Make two graphs close d78672a [zsxwing] Make the X axis use the same range 881c907 [zsxwing] Use histogram for distribution 5688702 [zsxwing] Fix the unit test ddf741a [zsxwing] Fix the unit test ad93295 [zsxwing] Remove unnecessary codes a0458f9 [zsxwing] Clean the codes b82ed1e [zsxwing] Update the graphs as per comments dd653a1 [zsxwing] Add timeline and histogram graphs for streaming statistics
*	[SPARK-5888] [MLLIB] Add OneHotEncoder as a Transformer	Sandy Ryza	2015-05-05	2	-0/+187
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach. A couple choices made here: * There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn. * The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`. These can be easily gotten from a `StringIndexer`. The names are used for the output column names, which take the form colName_categoryName. Author: Sandy Ryza <sandy@cloudera.com> Closes #5500 from sryza/sandy-spark-5888 and squashes the following commits: f383250 [Sandy Ryza] Infer label names automatically 6e257b9 [Sandy Ryza] Review comments 7c539cf [Sandy Ryza] Vector transformers 1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer
*	[SPARK-7333] [MLLIB] Add BinaryClassificationEvaluator to PySpark	Xiangrui Meng	2015-05-05	8	-3/+193
\| \| \| \| \| \| \| \| \| \| \| \|	This PR adds `BinaryClassificationEvaluator` to Python ML Pipelines API, which is a simple wrapper of the Scala implementation. oefirouz Author: Xiangrui Meng <meng@databricks.com> Closes #5885 from mengxr/SPARK-7333 and squashes the following commits: 25d7451 [Xiangrui Meng] fix tests in python 3 babdde7 [Xiangrui Meng] fix doc cb51e6a [Xiangrui Meng] add BinaryClassificationEvaluator in PySpark
*	[SPARK-7243][SQL] Reduce size for Contingency Tables in DataFrames	Burak Yavuz	2015-05-05	3	-11/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reduced take size from 1e8 to 1e6. cc rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5900 from brkyvz/df-cont-followup and squashes the following commits: c11e762 [Burak Yavuz] fix grammar b30ace2 [Burak Yavuz] address comments a417ba5 [Burak Yavuz] [SPARK-7243][SQL] Reduce size for Contingency Tables in DataFrames
*	[SPARK-7007] [CORE] Add a metric source for ExecutorAllocationManager	jerryshao	2015-05-05	2	-0/+32
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add a metric source to expose the internal status of ExecutorAllocationManager to better monitoring the resource usage of executors when dynamic allocation is enable. Please help to review, thanks a lot. Author: jerryshao <saisai.shao@intel.com> Closes #5589 from jerryshao/dynamic-allocation-source and squashes the following commits: 104d155 [jerryshao] rebase and address the comments c501a2c [jerryshao] Address the comments d237ba5 [jerryshao] Address the comments 2c3540f [jerryshao] Add a metric source for ExecutorAllocationManager
*	[SPARK-7318] [STREAMING] DStream cleans objects that are not closures	Andrew Or	2015-05-05	3	-3/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	I added a check in `ClosureCleaner#clean` to fail fast if this is detected in the future. tdas Author: Andrew Or <andrew@databricks.com> Closes #5860 from andrewor14/streaming-closure-cleaner and squashes the following commits: 8e971d7 [Andrew Or] Do not throw exception if object to clean is not closure 5ee4e25 [Andrew Or] Fix tests eed3390 [Andrew Or] Merge branch 'master' of github.com:apache/spark into streaming-closure-cleaner 67eeff4 [Andrew Or] Add tests a4fa768 [Andrew Or] Clean the closure, not the RDD
*	[SPARK-7237] Many user provided closures are not actually cleaned	Andrew Or	2015-05-05	5	-16/+174
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Note: ~140 lines are tests. In a nutshell, we never cleaned closures the user provided through the following operations: - sortBy - keyBy - mapPartitions - mapPartitionsWithIndex - aggregateByKey - foldByKey - foreachAsync - one of the aliases for runJob - runApproximateJob For more details on a reproduction and why they were not cleaned, please see [SPARK-7237](https://issues.apache.org/jira/browse/SPARK-7237). Author: Andrew Or <andrew@databricks.com> Closes #5787 from andrewor14/clean-more and squashes the following commits: 2f1f476 [Andrew Or] Merge branch 'master' of github.com:apache/spark into clean-more 7265865 [Andrew Or] Merge branch 'master' of github.com:apache/spark into clean-more df3caa3 [Andrew Or] Address comments 7a3cc80 [Andrew Or] Merge branch 'master' of github.com:apache/spark into clean-more 6498f44 [Andrew Or] Add missing test for groupBy e83699e [Andrew Or] Clean one more 8ac3074 [Andrew Or] Prevent NPE in tests when CC is used outside of an app 9ac5f9b [Andrew Or] Clean closures that are not currently cleaned 19e33b4 [Andrew Or] Add tests for all public RDD APIs that take in closures
*	[MLLIB] [TREE] Verify size of input rdd > 0 when building meta data	Alain	2015-05-05	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	Require non empty input rdd such that we can take the first labeledpoint and get the feature size Author: Alain <aihe@usc.edu> Author: aihe@usc.edu <aihe@usc.edu> Closes #5810 from AiHe/decisiontree-issue and squashes the following commits: 3b1d08a [aihe@usc.edu] [MLLIB][tree] merge the assertion into the evaluation of numFeatures cf2e567 [Alain] [MLLIB][tree] Use a rdd api to verify size of input rdd > 0 when building meta data b448f47 [Alain] [MLLIB][tree] Verify size of input rdd > 0 when building meta data
*	Closes #5591	Xiangrui Meng	2015-05-05	0	-0/+0
\| \| \| \|	Closes #5878
*	[SPARK-6612] [MLLIB] [PYSPARK] Python KMeans parity	Hrishikesh Subramonian	2015-05-05	4	-8/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The following items are added to Python kmeans: kmeans - setEpsilon, setInitializationSteps KMeansModel - computeCost, k Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com> Closes #5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits: b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test 5fd3ced [Hrishikesh Subramonian] doc test corrections 20b3c68 [Hrishikesh Subramonian] python 3 fixes 4d4e695 [Hrishikesh Subramonian] added arguments in python tests 21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
*	[SPARK-7202] [MLLIB] [PYSPARK] Add SparseMatrixPickler to SerDe	MechCoder	2015-05-05	4	-3/+72
\| \| \| \| \| \| \| \| \| \|	Utilities for pickling and unpickling SparseMatrices using SerDe Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #5775 from MechCoder/spark-7202 and squashes the following commits: 7e689dc [MechCoder] [SPARK-7202] Add SparseMatrixPickler to SerDe
*	[SPARK-7350] [STREAMING] [WEBUI] Attach the Streaming tab when calling ↵	zsxwing	2015-05-05	2	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \|	ssc.start() It's meaningless to display the Streaming tab before `ssc.start()`. So we should attach it in the `ssc.start` method. Author: zsxwing <zsxwing@gmail.com> Closes #5898 from zsxwing/SPARK-7350 and squashes the following commits: e676487 [zsxwing] Attach the Streaming tab when calling ssc.start()