spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[SPARK-10144] [UI] Actually show peak execution memory by default	Andrew Or	2015-08-24	2	-6/+8
\| \| \| \| \| \| \| \|	The peak execution memory metric was introduced in SPARK-8735. That was before Tungsten was enabled by default, so it assumed that `spark.sql.unsafe.enabled` must be explicitly set to true. The result is that the memory is not displayed by default. Author: Andrew Or <andrew@databricks.com> Closes #8345 from andrewor14/show-memory-default.
*	[SPARK-7710] [SPARK-7998] [DOCS] Docs for DataFrameStatFunctions	Burak Yavuz	2015-08-24	2	-1/+102
\| \| \| \| \| \| \| \| \| \|	This PR contains examples on how to use some of the Stat Functions available for DataFrames under `df.stat`. rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #8378 from brkyvz/update-sql-docs.
*	[SPARK-9791] [PACKAGE] Change private class to private class to prevent ↵	Tathagata Das	2015-08-24	13	-54/+28
\| \| \| \| \| \| \| \| \| \| \| \|	unnecessary classes from showing up in the docs In addition, some random cleanup of import ordering Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8387 from tdas/SPARK-9791 and squashes the following commits: 67f3ee9 [Tathagata Das] Change private class to private[package] class to prevent them from showing up in the docs
*	[SPARK-10168] [STREAMING] Fix the issue that maven publishes wrong artifact jars	zsxwing	2015-08-24	5	-25/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR removed the `outputFile` configuration from pom.xml and updated `tests.py` to search jars for both sbt build and maven build. I ran ` mvn -Pkinesis-asl -DskipTests clean install` locally, and verified the jars in my local repository were correct. I also checked Python tests for maven build, and it passed all tests. Author: zsxwing <zsxwing@gmail.com> Closes #8373 from zsxwing/SPARK-10168 and squashes the following commits: e0b5818 [zsxwing] Fix the sbt build c697627 [zsxwing] Add the jar pathes to the exception message be1d8a5 [zsxwing] Fix the issue that maven publishes wrong artifact jars
*	[SPARK-10142] [STREAMING] Made python checkpoint recovery handle non-local ↵	Tathagata Das	2015-08-23	3	-16/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	checkpoint paths and existing SparkContexts The current code only checks checkpoint files in local filesystem, and always tries to create a new Python SparkContext (even if one already exists). The solution is to do the following: 1. Use the same code path as Java to check whether a valid checkpoint exists 2. Create a new Python SparkContext only if there no active one. There is not test for the path as its hard to test with distributed filesystem paths in a local unit test. I am going to test it with a distributed file system manually to verify that this patch works. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8366 from tdas/SPARK-10142 and squashes the following commits: 3afa666 [Tathagata Das] Added tests 2dd4ae5 [Tathagata Das] Added the check to not create a context if one already exists 9bf151b [Tathagata Das] Made python checkpoint recovery use java to find the checkpoint files
*	[SPARK-10164] [MLLIB] Fixed GMM distributed decomposition bug	Joseph K. Bradley	2015-08-23	2	-9/+35
\| \| \| \| \| \| \| \| \| \| \| \|	GaussianMixture now distributes matrix decompositions for certain problem sizes. Distributed computation actually fails, but this was not tested in unit tests. This PR adds a unit test which checks this. It failed previously but works with this fix. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8370 from jkbradley/gmm-fix.
*	[SPARK-10148] [STREAMING] Display active and inactive receiver numbers in ↵	zsxwing	2015-08-23	2	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \|	Streaming page Added the active and inactive receiver numbers in the summary section of Streaming page. <img width="1074" alt="screen shot 2015-08-21 at 2 08 54 pm" src="https://cloud.githubusercontent.com/assets/1000778/9402437/ff2806a2-480f-11e5-8f8e-efdf8e5d514d.png"> Author: zsxwing <zsxwing@gmail.com> Closes #8351 from zsxwing/receiver-number.
*	Update streaming-programming-guide.md	Keiji Yoshida	2015-08-23	1	-1/+1
\| \| \| \| \| \| \| \|	Update `See the Scala example` to `See the Java example`. Author: Keiji Yoshida <yoshida.keiji.84@gmail.com> Closes #8376 from yosssi/patch-1.
*	[SPARK-9401] [SQL] Fully implement code generation for ConcatWs	Yijie Shen	2015-08-22	1	-3/+39
\| \| \| \| \| \| \| \| \| \| \| \|	This PR adds full codegen support for ConcatWs, is a substitute of #7782 JIRA: https://issues.apache.org/jira/browse/SPARK-9401 cc davies Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #8353 from yjshen/concatws.
*	Update programming-guide.md	Keiji Yoshida	2015-08-22	1	-1/+1
\| \| \| \| \| \| \| \|	Update `lineLengths.persist();` to `lineLengths.persist(StorageLevel.MEMORY_ONLY());` because `JavaRDD#persist` needs a parameter of `StorageLevel`. Author: Keiji Yoshida <yoshida.keiji.84@gmail.com> Closes #8372 from yosssi/patch-1.
*	[SPARK-9893] User guide with Java test suite for VectorSlicer	Xusen Yin	2015-08-21	2	-0/+218
\| \| \| \| \| \| \| \| \| \|	Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer. Note that Python version does not support selecting by names now. Author: Xusen Yin <yinxusen@gmail.com> Closes #8267 from yinxusen/SPARK-9893.
*	[SPARK-10163] [ML] Allow single-category features for GBT models	Joseph K. Bradley	2015-08-21	1	-5/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Removed categorical feature info validation since no longer needed This is needed to make the ML user guide examples work (in another current PR). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8367 from jkbradley/gbt-single-cat.
*	[SPARK-10143] [SQL] Use parquet's block size (row group size) setting as the ↵	Yin Huai	2015-08-21	1	-2/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	min split size if necessary. https://issues.apache.org/jira/browse/SPARK-10143 With this PR, we will set min split size to parquet's block size (row group size) set in the conf if the min split size is smaller. So, we can avoid have too many tasks and even useless tasks for reading parquet data. I tested it locally. The table I have has 343MB and it is in my local FS. Because I did not set any min/max split size, the default split size was 32MB and the map stage had 11 tasks. But there were only three tasks that actually read data. With my PR, there were only three tasks in the map stage. Here is the difference. Without this PR: ![image](https://cloud.githubusercontent.com/assets/2072857/9399179/8587dba6-4765-11e5-9189-7ebba52a2b6d.png) With this PR: ![image](https://cloud.githubusercontent.com/assets/2072857/9399185/a4735d74-4765-11e5-8848-1f1e361a6b4b.png) Even if the block size setting does match the actual block size of parquet file, I think it is still generally good to use parquet's block size setting if min split size is smaller than this block size. Tested it on a cluster using ``` val count = sqlContext.table("""store_sales""").groupBy().count().queryExecution.executedPlan(3).execute().count ``` Basically, it reads 0 column of table `store_sales`. My table has 1824 parquet files with size from 80MB to 280MB (1 to 3 row group sizes). Without this patch, in a 16 worker cluster, the job had 5023 tasks and spent 102s. With this patch, the job had 2893 tasks and spent 64s. It is still not as good as using one mapper per file (1824 tasks and 42s), but it is much better than our master. Author: Yin Huai <yhuai@databricks.com> Closes #8346 from yhuai/parquetMinSplit.
*	[SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since annotation	MechCoder	2015-08-21	68	-862/+692
\| \| \| \| \| \|	Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8352 from MechCoder/since.
*	[SPARK-10122] [PYSPARK] [STREAMING] Fix getOffsetRanges bug in ↵	jerryshao	2015-08-21	2	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	PySpark-Streaming transform function Details of the bug and explanations can be seen in [SPARK-10122](https://issues.apache.org/jira/browse/SPARK-10122). tdas , please help to review. Author: jerryshao <sshao@hortonworks.com> Closes #8347 from jerryshao/SPARK-10122 and squashes the following commits: 4039b16 [jerryshao] Fix getOffsetRanges in transform() bug
*	[SPARK-10130] [SQL] type coercion for IF should have children resolved first	Daoyuan Wang	2015-08-21	2	-0/+8
\| \| \| \| \| \| \| \|	Type coercion for IF should have children resolved first, or we could meet unresolved exception. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #8331 from adrian-wang/spark10130.
*	[SPARK-9439] [YARN] External shuffle service robust to NM restarts using leveldb	Imran Rashid	2015-08-21	21	-215/+1031
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-9439 In general, Yarn apps should be robust to NodeManager restarts. However, if you run spark with the external shuffle service on, after a NM restart all shuffles fail, b/c the shuffle service has lost some state with info on each executor. (Note the shuffle data is perfectly fine on disk across a NM restart, the problem is we've lost the small bit of state that lets us find those files.) The solution proposed here is that the external shuffle service can write out its state to leveldb (backed by a local file) every time an executor is added. When running with yarn, that file is in the NM's local dir. Whenever the service is started, it looks for that file, and if it exists, it reads the file and re-registers all executors there. Nothing is changed in non-yarn modes with this patch. The service is not given a place to save the state to, so it operates the same as before. This should make it easy to update other cluster managers as well, by just supplying the right file & the equivalent of yarn's `initializeApplication` -- I'm not familiar enough with those modes to know how to do that. Author: Imran Rashid <irashid@cloudera.com> Closes #7943 from squito/leveldb_external_shuffle_service_NM_restart and squashes the following commits: 0d285d3 [Imran Rashid] review feedback 70951d6 [Imran Rashid] Merge branch 'master' into leveldb_external_shuffle_service_NM_restart 5c71c8c [Imran Rashid] save executor to db before registering; style 2499c8c [Imran Rashid] explicit dependency on jackson-annotations 795d28f [Imran Rashid] review feedback 81f80e2 [Imran Rashid] Merge branch 'master' into leveldb_external_shuffle_service_NM_restart 594d520 [Imran Rashid] use json to serialize application executor info 1a7980b [Imran Rashid] version 8267d2a [Imran Rashid] style e9f99e8 [Imran Rashid] cleanup the handling of bad dbs a little 9378ba3 [Imran Rashid] fail gracefully on corrupt leveldb files acedb62 [Imran Rashid] switch to writing out one record per executor 79922b7 [Imran Rashid] rely on yarn to call stopApplication; assorted cleanup 12b6a35 [Imran Rashid] save registered executors when apps are removed; add tests c878fbe [Imran Rashid] better explanation of shuffle service port handling 694934c [Imran Rashid] only open leveldb connection once per service d596410 [Imran Rashid] store executor data in leveldb 59800b7 [Imran Rashid] Files.move in case renaming is unsupported 32fe5ae [Imran Rashid] Merge branch 'master' into external_shuffle_service_NM_restart d7450f0 [Imran Rashid] style f729e2b [Imran Rashid] debugging 4492835 [Imran Rashid] lol, dont use a PrintWriter b/c of scalastyle checks 0a39b98 [Imran Rashid] Merge branch 'master' into external_shuffle_service_NM_restart 55f49fc [Imran Rashid] make sure the service doesnt die if the registered executor file is corrupt; add tests 245db19 [Imran Rashid] style 62586a6 [Imran Rashid] just serialize the whole executors map bdbbf0d [Imran Rashid] comments, remove some unnecessary changes 857331a [Imran Rashid] better tests & comments bb9d1e6 [Imran Rashid] formatting bdc4b32 [Imran Rashid] rename 86e0cb9 [Imran Rashid] for tests, shuffle service finds an open port 23994ff [Imran Rashid] style 7504de8 [Imran Rashid] style a36729c [Imran Rashid] cleanup efb6195 [Imran Rashid] proper unit test, and no longer leak if apps stop during NM restart dd93dc0 [Imran Rashid] test for shuffle service w/ NM restarts d596969 [Imran Rashid] cleanup imports 0e9d69b [Imran Rashid] better names 9eae119 [Imran Rashid] cleanup lots of duplication 1136f44 [Imran Rashid] test needs to have an actual shuffle 0b588bd [Imran Rashid] more fixes ... ad122ef [Imran Rashid] more fixes 5e5a7c3 [Imran Rashid] fix build c69f46b [Imran Rashid] maybe working version, needs tests & cleanup ... bb3ba49 [Imran Rashid] minor cleanup 36127d3 [Imran Rashid] wip b9d2ced [Imran Rashid] incomplete setup for external shuffle service tests
*	[SPARK-10040] [SQL] Use batch insert for JDBC writing	Liang-Chi Hsieh	2015-08-21	1	-3/+14
\| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-10040 We should use batch insert instead of single row in JDBC. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8273 from viirya/jdbc-insert-batch.
*	[SPARK-9846] [DOCS] User guide for Multilayer Perceptron Classifier	Alexander Ulanov	2015-08-20	2	-0/+124
\| \| \| \| \| \| \| \| \| \|	Added user guide for multilayer perceptron classifier: - Simplified description of the multilayer perceptron classifier - Example code for Scala and Java Author: Alexander Ulanov <nashb@yandex.ru> Closes #8262 from avulanov/SPARK-9846-mlpc-docs.
*	[SPARK-10140] [DOC] add target fields to @Since	Xiangrui Meng	2015-08-20	1	-0/+2
\| \| \| \| \| \| \| \|	so constructors parameters and public fields can be annotated. rxin MechCoder Author: Xiangrui Meng <meng@databricks.com> Closes #8344 from mengxr/SPARK-10140.2.
*	[SPARK-9400] [SQL] codegen for StringLocate	Tarek Auel	2015-08-20	1	-1/+27
\| \| \| \| \| \| \| \| \| \| \|	This is based on #7779 , thanks to tarekauel . Fix the conflict and nullability. Closes #7779 and #8274 . Author: Tarek Auel <tarek.auel@googlemail.com> Author: Davies Liu <davies@databricks.com> Closes #8330 from davies/stringLocate.
*	[SPARK-9245] [MLLIB] LDA topic assignments	Joseph K. Bradley	2015-08-20	4	-7/+74
\| \| \| \| \| \| \| \| \| \|	For each (document, term) pair, return top topic. Note that instances of (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we should provide an estimate per document-term, rather than per token. CC: rotationsymmetry mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8329 from jkbradley/lda-topic-assignments.
*	[SPARK-10108] Add since tags to mllib.feature	MechCoder	2015-08-20	9	-11/+76
\| \| \| \| \| \|	Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8309 from MechCoder/tags_feature.
*	[SPARK-10138] [ML] move setters to MultilayerPerceptronClassifier and add ↵	Xiangrui Meng	2015-08-20	2	-27/+101
\| \| \| \| \| \| \| \| \| \|	Java test suite Otherwise, setters do not return self type. jkbradley avulanov Author: Xiangrui Meng <meng@databricks.com> Closes #8342 from mengxr/SPARK-10138.
*	[SQL] [MINOR] remove unnecessary class	Wenchen Fan	2015-08-20	1	-64/+0
\| \| \| \| \| \| \| \|	This class is identical to `org.apache.spark.sql.execution.datasources.jdbc. DefaultSource` and is not needed. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8334 from cloud-fan/minor.
*	[SPARK-10126] [PROJECT INFRA] Fix typo in release-build.sh which broke ↵	Josh Rosen	2015-08-20	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	snapshot publishing for Scala 2.11 The current `release-build.sh` has a typo which breaks snapshot publication for Scala 2.11. We should change the Scala version to 2.11 and clean before building a 2.11 snapshot. Author: Josh Rosen <joshrosen@databricks.com> Closes #8325 from JoshRosen/fix-2.11-snapshots.
*	[SPARK-10136] [SQL] Fixes Parquet support for Avro array of primitive array	Cheng Lian	2015-08-20	13	-844/+1718
\| \| \| \| \| \| \| \|	I caught SPARK-10136 while adding more test cases to `ParquetAvroCompatibilitySuite`. Actual bug fix code lies in `CatalystRowConverter.scala`. Author: Cheng Lian <lian@databricks.com> Closes #8341 from liancheng/spark-10136/parquet-avro-nested-primitive-array.
*	[SPARK-9982] [SPARKR] SparkR DataFrame fail to return data of Decimal type	Alex Shkurenko	2015-08-20	1	-0/+5
\| \| \| \| \| \|	Author: Alex Shkurenko <ashkurenko@enova.com> Closes #8239 from ashkurenko/master.
*	[MINOR] [SQL] Fix sphinx warnings in PySpark SQL	MechCoder	2015-08-20	2	-5/+7
\| \| \| \| \| \|	Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8171 from MechCoder/sql_sphinx.
*	[SPARK-10100] [SQL] Eliminate hash table lookup if there is no grouping key ↵	Reynold Xin	2015-08-20	2	-10/+22
\| \| \| \| \| \| \| \| \| \|	in aggregation. This improves performance by ~ 20 - 30% in one of my local test and should fix the performance regression from 1.4 to 1.5 on ss_max. Author: Reynold Xin <rxin@databricks.com> Closes #8332 from rxin/SPARK-10100.
*	[SPARK-10092] [SQL] Multi-DB support follow up.	Yin Huai	2015-08-20	16	-94/+398
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-10092 This pr is a follow-up one for Multi-DB support. It has the following changes: * `HiveContext.refreshTable` now accepts `dbName.tableName`. * `HiveContext.analyze` now accepts `dbName.tableName`. * `CreateTableUsing`, `CreateTableUsingAsSelect`, `CreateTempTableUsing`, `CreateTempTableUsingAsSelect`, `CreateMetastoreDataSource`, and `CreateMetastoreDataSourceAsSelect` all take `TableIdentifier` instead of the string representation of table name. * When you call `saveAsTable` with a specified database, the data will be saved to the correct location. * Explicitly do not allow users to create a temporary with a specified database name (users cannot do it before). * When we save table to metastore, we also check if db name and table name can be accepted by hive (using `MetaStoreUtils.validateName`). Author: Yin Huai <yhuai@databricks.com> Closes #8324 from yhuai/saveAsTableDB.
*	[SPARK-10128] [STREAMING] Used correct classloader to deserialize WAL data	Tathagata Das	2015-08-19	1	-2/+3
\| \| \| \| \| \| \| \| \| \|	Recovering Kinesis sequence numbers from WAL leads to classnotfoundexception because the ObjectInputStream does not use the correct classloader and the SequenceNumberRanges class (in streaming-kinesis-asl package) cannot be found (added through spark-submit) while deserializing. The solution is to use `Thread.currentThread().getContextClassLoader` while deserializing. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8328 from tdas/SPARK-10128 and squashes the following commits: f19b1c2 [Tathagata Das] Used correct classloader to deserialize WAL data
*	[SPARK-10124] [MESOS] Fix removing queued driver in mesos cluster mode.	Timothy Chen	2015-08-19	1	-8/+11
\| \| \| \| \| \| \| \| \| \|	Currently the spark applications can be queued to the Mesos cluster dispatcher, but when multiple jobs are in queue we don't handle removing jobs from the buffer correctly while iterating and causes null pointer exception. This patch copies the buffer before iterating them, so exceptions aren't thrown when the jobs are removed. Author: Timothy Chen <tnachen@gmail.com> Closes #8322 from tnachen/fix_cluster_mode.
*	[SPARK-10125] [STREAMING] Fix a potential deadlock in JobGenerator.stop	zsxwing	2015-08-19	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Because `lazy val` uses `this` lock, if JobGenerator.stop and JobGenerator.doCheckpoint (JobGenerator.shouldCheckpoint has not yet been initialized) run at the same time, it may hang. Here are the stack traces for the deadlock: ```Java "pool-1-thread-1-ScalaTest-running-StreamingListenerSuite" #11 prio=5 os_prio=31 tid=0x00007fd35d094800 nid=0x5703 in Object.wait() [0x000000012ecaf000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1245) - locked <0x00000007b5d8d7f8> (a org.apache.spark.util.EventLoop$$anon$1) at java.lang.Thread.join(Thread.java:1319) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:81) at org.apache.spark.streaming.scheduler.JobGenerator.stop(JobGenerator.scala:155) - locked <0x00000007b5d8cea0> (a org.apache.spark.streaming.scheduler.JobGenerator) at org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:95) - locked <0x00000007b5d8ced8> (a org.apache.spark.streaming.scheduler.JobScheduler) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:687) "JobGenerator" #67 daemon prio=5 os_prio=31 tid=0x00007fd35c3b9800 nid=0x9f03 waiting for monitor entry [0x0000000139e4a000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.spark.streaming.scheduler.JobGenerator.shouldCheckpoint$lzycompute(JobGenerator.scala:63) - waiting to lock <0x00000007b5d8cea0> (a org.apache.spark.streaming.scheduler.JobGenerator) at org.apache.spark.streaming.scheduler.JobGenerator.shouldCheckpoint(JobGenerator.scala:63) at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:290) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:182) at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:83) at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:82) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) ``` I can use this patch to produce this deadlock: https://github.com/zsxwing/spark/commit/8a88f28d1331003a65fabef48ae3d22a7c21f05f And a timeout build in Jenkins due to this deadlock: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1654/ This PR initializes `checkpointWriter` before `eventLoop` uses it to avoid this deadlock. Author: zsxwing <zsxwing@gmail.com> Closes #8326 from zsxwing/SPARK-10125.
*	[SPARK-9812] [STREAMING] Fix Python 3 compatibility issue in PySpark ↵	zsxwing	2015-08-19	8	-14/+23
\| \| \| \| \| \| \| \| \| \| \| \| \|	Streaming and some docs This PR includes the following fixes: 1. Use `range` instead of `xrange` in `queue_stream.py` to support Python 3. 2. Fix the issue that `utf8_decoder` will return `bytes` rather than `str` when receiving an empty `bytes` in Python 3. 3. Fix the commands in docs so that the user can copy them directly to the command line. The previous commands was broken in the middle of a path, so when copying to the command line, the path would be split to two parts by the extra spaces, which forces the user to fix it manually. Author: zsxwing <zsxwing@gmail.com> Closes #8315 from zsxwing/SPARK-9812.
*	[SPARK-9242] [SQL] Audit UDAF interface.	Reynold Xin	2015-08-19	18	-349/+386
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A few minor changes: 1. Improved documentation 2. Rename apply(distinct....) to distinct. 3. Changed MutableAggregationBuffer from a trait to an abstract class. 4. Renamed returnDataType to dataType to be more consistent with other expressions. And unrelated to UDAFs: 1. Renamed file names in expressions to use suffix "Expressions" to be more consistent. 2. Moved regexp related expressions out to its own file. 3. Renamed StringComparison => StringPredicate. Author: Reynold Xin <rxin@databricks.com> Closes #8321 from rxin/SPARK-9242.
*	[SPARK-10035] [SQL] Parquet filters does not process EqualNullSafe filter.	hyukjinkwon	2015-08-20	2	-139/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As I talked with Lian, 1. I added EquelNullSafe to ParquetFilters - It uses the same equality comparison filter with EqualTo since the Parquet filter performs actually null-safe equality comparison. 2. Updated the test code (ParquetFilterSuite) - Convert catalyst.Expression to sources.Filter - Removed Cast since only Literal is picked up as a proper Filter in DataSourceStrategy - Added EquelNullSafe comparison 3. Removed deprecated createFilter for catalyst.Expression Author: hyukjinkwon <gurwls223@gmail.com> Author: 권혁진 <gurwls223@gmail.com> Closes #8275 from HyukjinKwon/master.
*	[SPARK-9895] User Guide for RFormula Feature Transformer	Eric Liang	2015-08-19	2	-2/+110
\| \| \| \| \| \| \| \|	mengxr Author: Eric Liang <ekl@databricks.com> Closes #8293 from ericl/docs-2.
*	[SPARK-6489] [SQL] add column pruning for Generate	Wenchen Fan	2015-08-19	3	-2/+100
\| \| \| \| \| \| \| \|	This PR takes over https://github.com/apache/spark/pull/5358 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8268 from cloud-fan/6489.
*	[SPARK-10119] [CORE] Fix isDynamicAllocationEnabled when config is ↵	Marcelo Vanzin	2015-08-19	2	-1/+15
\| \| \| \| \| \| \| \|	expliticly disabled. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8316 from vanzin/SPARK-10119.
*	[SPARK-10083] [SQL] CaseWhen should support type coercion of DecimalType and ↵	Daoyuan Wang	2015-08-19	2	-2/+13
\| \| \| \| \| \| \| \| \| \| \| \|	FractionalType create t1 (a decimal(7, 2), b long); select case when 1=1 then a else 1.0 end from t1; select case when 1=1 then a else b end from t1; Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #8270 from adrian-wang/casewhenfractional.
*	[SPARK-9899] [SQL] Disables customized output committer when speculation is on	Cheng Lian	2015-08-19	2	-1/+49
\| \| \| \| \| \| \| \| \| \| \| \|	Speculation hates direct output committer, as there are multiple corner cases that may cause data corruption and/or data loss. Please see this [PR comment] [1] for more details. [1]: https://github.com/apache/spark/pull/8191#issuecomment-131598385 Author: Cheng Lian <lian@databricks.com> Closes #8317 from liancheng/spark-9899/speculation-hates-direct-output-committer.
*	[SPARK-10090] [SQL] fix decimal scale of division	Davies Liu	2015-08-19	6	-31/+157
\| \| \| \| \| \| \| \|	We should rounding the result of multiply/division of decimal to expected precision/scale, also check overflow. Author: Davies Liu <davies@databricks.com> Closes #8287 from davies/decimal_division.
*	[SPARK-9627] [SQL] Stops using Scala runtime reflection in DictionaryEncoding	Cheng Lian	2015-08-19	2	-12/+4
\| \| \| \| \| \| \| \| \| \|	`DictionaryEncoding` uses Scala runtime reflection to avoid boxing costs while building the directory array. However, this code path may hit [SI-6240] [1] and throw exception. [1]: https://issues.scala-lang.org/browse/SI-6240 Author: Cheng Lian <lian@databricks.com> Closes #8306 from liancheng/spark-9627/in-memory-cache-scala-reflection.
*	[SPARK-10073] [SQL] Python withColumn should replace the old column	Davies Liu	2015-08-19	3	-7/+12
\| \| \| \| \| \| \| \| \| \|	DataFrame.withColumn in Python should be consistent with the Scala one (replacing the existing column that has the same name). cc marmbrus Author: Davies Liu <davies@databricks.com> Closes #8300 from davies/with_column.
*	[SPARK-10107] [SQL] fix NPE in format_number	Davies Liu	2015-08-19	2	-3/+3
\| \| \| \| \| \|	Author: Davies Liu <davies@databricks.com> Closes #8305 from davies/format_number.
*	[SPARK-8889] [CORE] Fix for OOM for graph creation	Joshi	2015-08-19	2	-11/+51
\| \| \| \| \| \| \| \| \|	Fix for OOM for graph creation Author: Joshi <rekhajoshm@gmail.com> Author: Rekha Joshi <rekhajoshm@gmail.com> Closes #7602 from rekhajoshm/SPARK-8889.
*	[SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clustering	Xiangrui Meng	2015-08-19	9	-52/+338
\| \| \| \| \| \| \| \| \| \| \| \|	This continues the work from #8256. I removed `since` tags from private/protected/local methods/variables (see https://github.com/apache/spark/commit/72fdeb64630470f6f46cf3eed8ffbfe83a7c4659). MechCoder Closes #8256 Author: Xiangrui Meng <meng@databricks.com> Author: Xiaoqing Wang <spark445@126.com> Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8288 from mengxr/SPARK-8918.
*	[SPARK-10106] [SPARKR] Add `ifelse` Column function to SparkR	Yu ISHIKAWA	2015-08-19	3	-1/+22
\| \| \| \| \| \| \| \| \|	### JIRA [[SPARK-10106] Add `ifelse` Column function to SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10106) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8303 from yu-iskw/SPARK-10106.
*	[SPARK-10097] Adds `shouldMaximize` flag to `ml.evaluation.Evaluator`	Feynman Liang	2015-08-19	10	-22/+52
\| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user. This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized. CC jkbradley Author: Feynman Liang <fliang@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8290 from feynmanliang/SPARK-10097.