spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Preparing development version 1.4.1-SNAPSHOT	Patrick Wendell	2015-05-19	30	-30/+30
\|
*	Preparing Spark release v1.4.0-rc1	Patrick Wendell	2015-05-19	30	-30/+30
\|
*	Updating CHANGES.txt for Spark 1.4	Patrick Wendell	2015-05-19	1	-0/+70
\|
*	Revert "Preparing Spark release v1.4.0-rc1"	Patrick Wendell	2015-05-19	30	-30/+30
\| \| \| \|	This reverts commit 38ccef36c1551dc36d9444f47df11ae34c1e139e.
*	Revert "Preparing development version 1.4.1-SNAPSHOT"	Patrick Wendell	2015-05-19	30	-30/+30
\| \| \| \|	This reverts commit 40190ce22622cadd41f740a763fba061281c2966.
*	[SPARK-7581] [ML] [DOC] User guide for spark.ml PolynomialExpansion	Xusen Yin	2015-05-19	2	-0/+174
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA [here](https://issues.apache.org/jira/browse/SPARK-7581). CC jkbradley Author: Xusen Yin <yinxusen@gmail.com> Closes #6113 from yinxusen/SPARK-7581 and squashes the following commits: 1a7d80d [Xusen Yin] merge with master 892a8e9 [Xusen Yin] fix python 3 compatibility ec935bf [Xusen Yin] small fix 3e9fa1d [Xusen Yin] delete note 69fcf85 [Xusen Yin] simplify and add python example 81d21dc [Xusen Yin] add programming guide for Polynomial Expansion 40babfb [Xusen Yin] add java test suite for PolynomialExpansion (cherry picked from commit 6008ec14ed6491d0a854bb50548c46f2f9709269) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
*	[HOTFIX] Fixing style failures in Kinesis source	Patrick Wendell	2015-05-19	2	-4/+6
\|
*	Preparing development version 1.4.1-SNAPSHOT	Patrick Wendell	2015-05-19	30	-30/+30
\|
*	Preparing Spark release v1.4.0-rc1	Patrick Wendell	2015-05-19	30	-30/+30
\|
*	Revert "Preparing Spark release v1.4.0-rc1"	Patrick Wendell	2015-05-18	30	-30/+30
\| \| \| \|	This reverts commit e8e97e3a630dea3c68702e26bc56f61044b2db71.
*	Revert "Preparing development version 1.4.1-SNAPSHOT"	Patrick Wendell	2015-05-18	30	-30/+30
\| \| \| \|	This reverts commit 758ca74bab7c342f94442f69476c6b9543ac1228.
*	[HOTFIX]: Java 6 Build Breaks	Patrick Wendell	2015-05-19	2	-15/+2
\| \| \| \|	These were blocking RC1 so I fixed them manually.
*	Preparing development version 1.4.1-SNAPSHOT	Patrick Wendell	2015-05-19	30	-30/+30
\|
*	Preparing Spark release v1.4.0-rc1	Patrick Wendell	2015-05-19	30	-30/+30
\|
*	[SPARK-7687] [SQL] DataFrame.describe() should cast all aggregates to String	Josh Rosen	2015-05-18	3	-14/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In `DataFrame.describe()`, the `count` aggregate produces an integer, the `avg` and `stdev` aggregates produce doubles, and `min` and `max` aggregates can produce varying types depending on what type of column they're applied to. As a result, we should cast all aggregate results to String so that `describe()`'s output types match its declared output schema. Author: Josh Rosen <joshrosen@databricks.com> Closes #6218 from JoshRosen/SPARK-7687 and squashes the following commits: 146b615 [Josh Rosen] Fix R test. 2974bd5 [Josh Rosen] Cast to string type instead f206580 [Josh Rosen] Cast to double to fix SPARK-7687 307ecbf [Josh Rosen] Add failing regression test for SPARK-7687 (cherry picked from commit c9fa870a6de3f7d0903fa7a75ea5ffb6a2fcd174) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	CHANGES.txt and changelist updaets for Spark 1.4.	Patrick Wendell	2015-05-18	2	-2/+14608
\|
*	[SPARK-7150] SparkContext.range() and SQLContext.range()	Daoyuan Wang	2015-05-18	7	-0/+189
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR is based on #6081, thanks adrian-wang. Closes #6081 Author: Daoyuan Wang <daoyuan.wang@intel.com> Author: Davies Liu <davies@databricks.com> Closes #6230 from davies/range and squashes the following commits: d3ce5fe [Davies Liu] add tests 789eda5 [Davies Liu] add range() in Python 4590208 [Davies Liu] Merge commit 'refs/pull/6081/head' of github.com:apache/spark into range cbf5200 [Daoyuan Wang] let's add python support in a separate PR f45e3b2 [Daoyuan Wang] remove redundant toLong 617da76 [Daoyuan Wang] fix safe marge for corner cases 867c417 [Daoyuan Wang] fix 13dbe84 [Daoyuan Wang] update bd998ba [Daoyuan Wang] update comments d3a0c1b [Daoyuan Wang] add range api() (cherry picked from commit c2437de1899e09894df4ec27adfaa7fac158fd3a) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	Version updates for Spark 1.4.0	Patrick Wendell	2015-05-18	3	-3/+4
\|
*	[SPARK-7681] [MLLIB] Add SparseVector support for gemv	Liang-Chi Hsieh	2015-05-18	4	-33/+240
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-7681 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6209 from viirya/sparsevector_gemv and squashes the following commits: ce0bb8b [Liang-Chi Hsieh] Still need to scal y when beta is 0.0 because it clears out y. b890e63 [Liang-Chi Hsieh] Do not delete multiply for DenseVector. 57a8c1e [Liang-Chi Hsieh] Add MimaExcludes for v1.4. 458d1ae [Liang-Chi Hsieh] List DenseMatrix.multiply and SparseMatrix.multiply to MimaExcludes too. 054f05d [Liang-Chi Hsieh] Fix scala style. 410381a [Liang-Chi Hsieh] Address comments. Make Matrix.multiply more generalized. 4616696 [Liang-Chi Hsieh] Add support for SparseVector with SparseMatrix. 5d6d07a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sparsevector_gemv c069507 [Liang-Chi Hsieh] Add SparseVector support for gemv with DenseMatrix. (cherry picked from commit d03638cc2d414cee9ac7481084672e454495dfc1) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-7692] Updated Kinesis examples	Tathagata Das	2015-05-18	2	-237/+268
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Updated Kinesis examples to use stable API - Cleaned up comments, etc. - Renamed KinesisWordCountProducerASL to KinesisWordProducerASL Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6249 from tdas/kinesis-examples and squashes the following commits: 7cc307b [Tathagata Das] More tweaks f080872 [Tathagata Das] More cleanup 841987f [Tathagata Das] Small update 011cbe2 [Tathagata Das] More fixes b0d74f9 [Tathagata Das] Updated examples. (cherry picked from commit 3a6003866ade45974b43a9e785ec35fb76a32b99) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-7621] [STREAMING] Report Kafka errors to StreamingListeners	jerluc	2015-05-18	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	PR per [SPARK-7621](https://issues.apache.org/jira/browse/SPARK-7621), which makes both `KafkaReceiver` and `ReliableKafkaReceiver` report its errors to the `ReceiverTracker`, which in turn will add the events to the bus to fire off any registered `StreamingListener`s. Author: jerluc <jeremyalucas@gmail.com> Closes #6204 from jerluc/master and squashes the following commits: 82439a5 [jerluc] [SPARK-7621] [STREAMING] Report Kafka errors to StreamingListeners (cherry picked from commit 0a7a94eab5fba3d2f2ef14a70c2c1bf4ee21b626) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-7624] Revert #4147	Davies Liu	2015-05-18	1	-21/+2
\| \| \| \| \| \| \| \| \| \| \|	Author: Davies Liu <davies@databricks.com> Closes #6172 from davies/revert_4147 and squashes the following commits: 3bfbbde [Davies Liu] Revert #4147 (cherry picked from commit 4fb52f9545ae338fae2d3aeea4bfc35d5df44853) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SQL] Fix serializability of ORC table scan	Michael Armbrust	2015-05-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	A follow-up to #6244. Author: Michael Armbrust <michael@databricks.com> Closes #6247 from marmbrus/fixOrcTests and squashes the following commits: e39ee1b [Michael Armbrust] [SQL] Fix serializability of ORC table scan (cherry picked from commit eb4632f282d070e1dfd5ffed968fa212896137da) Signed-off-by: Yin Huai <yhuai@databricks.com>
*	[SPARK-7501] [STREAMING] DAG visualization: show DStream operations	Andrew Or	2015-05-18	14	-145/+484
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is similar to #5999, but for streaming. Roughly 200 lines are tests. One thing to note here is that we already do some kind of scoping thing for call sites, so this patch adds the new RDD operation scoping logic in the same place. Also, this patch adds a `try finally` block to set the relevant variables in a safer way. tdas zsxwing ------------------------ Before <img src="https://cloud.githubusercontent.com/assets/2133137/7625996/d88211b8-f9b4-11e4-90b9-e11baa52d6d7.png" width="450px"/> -------------------------- After <img src="https://cloud.githubusercontent.com/assets/2133137/7625997/e0878f8c-f9b4-11e4-8df3-7dd611b13c87.png" width="650px"/> Author: Andrew Or <andrew@databricks.com> Closes #6034 from andrewor14/dag-viz-streaming and squashes the following commits: 932a64a [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming e685df9 [Andrew Or] Rename createRDDWith 84d0656 [Andrew Or] Review feedback 697c086 [Andrew Or] Fix tests 53b9936 [Andrew Or] Set scopes for foreachRDD properly 1881802 [Andrew Or] Refactor DStream scope names again af4ba8d [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming fd07d22 [Andrew Or] Make MQTT lower case f6de871 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming 0ca1801 [Andrew Or] Remove a few unnecessary withScopes on aliases fa4e5fb [Andrew Or] Pass in input stream name rather than defining it from within 1af0b0e [Andrew Or] Fix style 074c00b [Andrew Or] Review comments d25a324 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming e4a93ac [Andrew Or] Fix tests? 25416dc [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming 9113183 [Andrew Or] Add tests for DStream scopes b3806ab [Andrew Or] Fix test bb80bbb [Andrew Or] Fix MIMA? 5c30360 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming 5703939 [Andrew Or] Rename operations that create InputDStreams 7c4513d [Andrew Or] Group RDDs by DStream operations and batches bf0ab6e [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming 05c2676 [Andrew Or] Wrap many more methods in withScope c121047 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming 65ef3e9 [Andrew Or] Fix NPE a0d3263 [Andrew Or] Scope streaming operations instead of RDD operations (cherry picked from commit b93c97d79b42a06b48d2a8d98beccc636442541e) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[HOTFIX] Fix ORC build break	Michael Armbrust	2015-05-18	1	-5/+6
\| \| \| \| \| \| \| \| \| \| \| \| \|	Fix break caused by merging #6225 and #6194. Author: Michael Armbrust <michael@databricks.com> Closes #6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break (cherry picked from commit fcf90b75ccf222bd2f1939addc3f8f052d2bd3ff) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[SPARK-7658] [STREAMING] [WEBUI] Update the mouse behaviors for the timeline ↵	zsxwing	2015-05-18	3	-2/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	graphs 1. If the user click one point of a batch, scroll down to the corresponding batch row and highlight it. And recovery the batch row after 3 seconds if necessary. 2. Add "#batches" in the histogram graphs. ![screen shot 2015-05-14 at 7 36 19 pm](https://cloud.githubusercontent.com/assets/1000778/7646108/84f4a014-fa73-11e4-8c13-1903d267e60f.png) ![screen shot 2015-05-14 at 7 36 53 pm](https://cloud.githubusercontent.com/assets/1000778/7646109/8b11154a-fa73-11e4-820b-8ece9fa6ee3e.png) ![screen shot 2015-05-14 at 7 36 34 pm](https://cloud.githubusercontent.com/assets/1000778/7646111/93828272-fa73-11e4-89f8-580670144d3c.png) Author: zsxwing <zsxwing@gmail.com> Closes #6168 from zsxwing/SPARK-7658 and squashes the following commits: c242b00 [zsxwing] Change 5 seconds to 3 seconds 31fd0aa [zsxwing] Remove the mouseover highlight feature 06c6f6f [zsxwing] Merge branch 'master' into SPARK-7658 2eaff06 [zsxwing] Merge branch 'master' into SPARK-7658 108d56c [zsxwing] Update the mouse behaviors for the timeline graphs (cherry picked from commit 0b6f503d5337a8387c37cc2c8e544f67c68f7dad) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-6216] [PYSPARK] check python version of worker with driver	Davies Liu	2015-05-18	10	-14/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR revert #5404, change to pass the version of python in driver into JVM, check it in worker before deserializing closure, then it can works with different major version of Python. Author: Davies Liu <davies@databricks.com> Closes #6203 from davies/py_version and squashes the following commits: b8fb76e [Davies Liu] fix test 6ce5096 [Davies Liu] use string for version 47c6278 [Davies Liu] check python version of worker with driver (cherry picked from commit 32fbd297dd651ba3ce4ce52aeb0488233149cdf9) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-7673] [SQL] WIP: HadoopFsRelation and ParquetRelation2 performance ↵	Cheng Lian	2015-05-18	4	-91/+117
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	optimizations This PR introduces several performance optimizations to `HadoopFsRelation` and `ParquetRelation2`: 1. Moving `FileStatus` listing from `DataSourceStrategy` into a cache within `HadoopFsRelation`. This new cache generalizes and replaces the one used in `ParquetRelation2`. This also introduces an interface change: to reuse cached `FileStatus` objects, `HadoopFsRelation.buildScan` methods now receive `Array[FileStatus]` instead of `Array[String]`. 1. When Parquet task side metadata reading is enabled, skip reading row group information when reading Parquet footers. This is basically what PR #5334 does. Also, now we uses `ParquetFileReader.readAllFootersInParallel` to read footers in parallel. Another optimization in question is, instead of asking `HadoopFsRelation.buildScan` to return an `RDD[Row]` for a single selected partition and then union them all, we ask it to return an `RDD[Row]` for all selected partitions. This optimization is based on the fact that Hadoop configuration broadcasting used in `NewHadoopRDD` takes 34% time in the following microbenchmark. However, this complicates data source user code because user code must merge partition values manually. To check the cost of broadcasting in `NewHadoopRDD`, I also did microbenchmark after removing the `broadcast` call in `NewHadoopRDD`. All results are shown below. ### Microbenchmark #### Preparation code Generating a partitioned table with 50k partitions, 1k rows per partition: ```scala import sqlContext._ import sqlContext.implicits._ for (n <- 0 until 500) { val data = for { p <- (n * 10) until ((n + 1) * 10) i <- 0 until 1000 } yield (i, f"val_$i%04d", f"$p%04d") data. toDF("a", "b", "p"). write. partitionBy("p"). mode("append"). parquet(path) } ``` #### Benchmarking code ```scala import sqlContext._ import sqlContext.implicits._ import org.apache.spark.sql.types._ import com.google.common.base.Stopwatch val path = "hdfs://localhost:9000/user/lian/5k" def benchmark(n: Int)(f: => Unit) { val stopwatch = new Stopwatch() def run() = { stopwatch.reset() stopwatch.start() f stopwatch.stop() stopwatch.elapsedMillis() } val records = (0 until n).map(_ => run()) (0 until n).foreach(i => println(s"Round $i: ${records(i)} ms")) println(s"Average: ${records.sum / n.toDouble} ms") } benchmark(3) { read.parquet(path).explain(extended = true) } ``` #### Results Before: ``` Round 0: 72528 ms Round 1: 68938 ms Round 2: 65372 ms Average: 68946.0 ms ``` After: ``` Round 0: 59499 ms Round 1: 53645 ms Round 2: 53844 ms Round 3: 49093 ms Round 4: 50555 ms Average: 53327.2 ms ``` Also removing Hadoop configuration broadcasting: (Note that I was testing on a local laptop, thus network cost is pretty low.) ``` Round 0: 15806 ms Round 1: 14394 ms Round 2: 14699 ms Round 3: 15334 ms Round 4: 14123 ms Average: 14871.2 ms ``` Author: Cheng Lian <lian@databricks.com> Closes #6225 from liancheng/spark-7673 and squashes the following commits: 2d58a2b [Cheng Lian] Skips reading row group information when using task side metadata reading 7aa3748 [Cheng Lian] Optimizes FileStatusCache by introducing a map from parent directories to child files ba41250 [Cheng Lian] Reuses HadoopFsRelation FileStatusCache in ParquetRelation2 3d278f7 [Cheng Lian] Fixes a bug when reading a single Parquet data file b84612a [Cheng Lian] Fixes Scala style issue 6a08b02 [Cheng Lian] WIP: Moves file status cache into HadoopFSRelation (cherry picked from commit 9dadf019b93038e1e18336ccd06c5eecb4bae32f) Signed-off-by: Yin Huai <yhuai@databricks.com>
*	[SPARK-7567] [SQL] [follow-up] Use a new flag to set output committer based ↵	Yin Huai	2015-05-18	4	-9/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	on mapreduce apis cc liancheng marmbrus Author: Yin Huai <yhuai@databricks.com> Closes #6130 from yhuai/directOutput and squashes the following commits: 312b07d [Yin Huai] A data source can use spark.sql.sources.outputCommitterClass to override the output committer. (cherry picked from commit 530397ba2f5c0fcabb86ba73048c95177ed0b9fc) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-7269] [SQL] Incorrect analysis for aggregation(use semanticEquals)	Wenchen Fan	2015-05-18	6	-26/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A modified version of https://github.com/apache/spark/pull/6110, use `semanticEquals` to make it more efficient. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6173 from cloud-fan/7269 and squashes the following commits: e4a3cc7 [Wenchen Fan] address comments cc02045 [Wenchen Fan] consider elements length equal d7ff8f4 [Wenchen Fan] fix 7269 (cherry picked from commit 103c863c2ef3d9e6186cfc7d95251a9515e9f180) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-7631] [SQL] treenode argString should not print children	scwf	2015-05-18	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	spark-sql> > explain extended > select * from ( > select key from src union all > select key from src) t; now the spark plan will print children in argString ``` == Physical Plan == Union[ HiveTableScan key#1, (MetastoreRelation default, src, None), None, HiveTableScan key#3, (MetastoreRelation default, src, None), None] HiveTableScan key#1, (MetastoreRelation default, src, None), None HiveTableScan key#3, (MetastoreRelation default, src, None), None ``` after this patch: ``` == Physical Plan == Union HiveTableScan [key#1], (MetastoreRelation default, src, None), None HiveTableScan [key#3], (MetastoreRelation default, src, None), None ``` I have tested this locally Author: scwf <wangfei1@huawei.com> Closes #6144 from scwf/fix-argString and squashes the following commits: 1a642e0 [scwf] fix treenode argString (cherry picked from commit fc2480ed13742a99470b5012ca3a75ab91e5a5e5) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-2883] [SQL] ORC data source for Spark SQL	Zhan Zhang	2015-05-18	14	-76/+1477
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR updates PR #6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > NOTE > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR #3753). Author: Zhan Zhang <zhazhan@gmail.com> Author: Cheng Lian <lian@databricks.com> Closes #6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @since and @Experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support (cherry picked from commit aa31e431fc09f0477f1c2351c6275769a31aca90) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-7380] [MLLIB] pipeline stages should be copyable in Python	Xiangrui Meng	2015-05-18	16	-261/+498
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes: 1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively. 2. Accept a list of param maps in `fit`. 3. Use parent uid and name to identify param. jkbradley Author: Xiangrui Meng <meng@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #6088 from mengxr/SPARK-7380 and squashes the following commits: 413c463 [Xiangrui Meng] remove unnecessary doc 4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380 611c719 [Xiangrui Meng] fix python style 68862b8 [Xiangrui Meng] update _java_obj initialization 927ad19 [Xiangrui Meng] fix ml/tests.py 0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer 9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params 7e0d27f [Xiangrui Meng] merge master 46840fb [Xiangrui Meng] update wrappers b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap 46cb6ed [Xiangrui Meng] merge master a163413 [Xiangrui Meng] fix style 1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380 9630eae [Xiangrui Meng] fix Identifiable._randomUID 13bd70a [Xiangrui Meng] update ml/tests.py 64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl 02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python 66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui 7431272 [Joseph K. Bradley] Rebased with master (cherry picked from commit 9c7e802a5a2b8cd3eb77642f84c54a8e976fc996) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SQL] [MINOR] [THIS] use private for internal field in ScalaUdf	Wenchen Fan	2015-05-18	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6235 from cloud-fan/tmp and squashes the following commits: 8f16367 [Wenchen Fan] use private[this] (cherry picked from commit 56ede88485cfca90974425fcb603b257be47229b) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-7570] [SQL] Ignores _temporary during partition discovery	Cheng Lian	2015-05-18	2	-19/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	<!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6091) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #6091 from liancheng/spark-7570 and squashes the following commits: 8ff07e8 [Cheng Lian] Ignores _temporary during partition discovery (cherry picked from commit 010a1c278037130a69dcc79427d2b0380a2c82d8) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-6888] [SQL] Make the jdbc driver handling user-definable	Rene Treffer	2015-05-18	6	-126/+295
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Replace the DriverQuirks with JdbcDialect(s) (and MySQLDialect/PostgresDialect) and allow developers to change the dialects on the fly (for new JDBCRRDs only). Some types (like an unsigned 64bit number) can be trivially mapped to java. The status quo is that the RRD will fail to load. This patch makes it possible to overwrite the type mapping to read e.g. 64Bit numbers as strings and handle them afterwards in software. JDBCSuite has an example that maps all types to String, which should always work (at the cost of extra code afterwards). As a side effect it should now be possible to develop simple dialects out-of-tree and even with spark-shell. Author: Rene Treffer <treffer@measite.de> Closes #5555 from rtreffer/jdbc-dialects and squashes the following commits: 3cbafd7 [Rene Treffer] [SPARK-6888] ignore classes belonging to changed API in MIMA report fe7e2e8 [Rene Treffer] [SPARK-6888] Make the jdbc driver handling user-definable
*	[SPARK-7627] [SPARK-7472] DAG visualization: style skipped stages	Andrew Or	2015-05-18	6	-108/+352
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes two things: SPARK-7627. Cached RDDs no longer light up on the job page. This is a simple fix. SPARK-7472. Display skipped stages differently from normal stages. The latter is a major UX issue. Because we link the job viz to the stage viz even for skipped stages, the user may inadvertently click into the stage page of a skipped stage, which is empty. ------------------- <img src="https://cloud.githubusercontent.com/assets/2133137/7675241/de1a3da6-fcea-11e4-8101-88055cef78c5.png" width="300px" /> Author: Andrew Or <andrew@databricks.com> Closes #6171 from andrewor14/dag-viz-skipped and squashes the following commits: f261797 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped 0eda358 [Andrew Or] Tweak skipped stage border color c604150 [Andrew Or] Tweak grayscale colors 7010676 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped 762b541 [Andrew Or] Use special prefix for stage clusters to avoid collisions 51c95b9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped b928cd4 [Andrew Or] Fix potential leak + write tests for it 7c4c364 [Andrew Or] Show skipped stages differently 7cc34ce [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped c121fa2 [Andrew Or] Fix cache color (cherry picked from commit 563bfcc1ab1b1c79b1845230c8c600db85a08fe3) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[SPARK-7272] [MLLIB] User guide for PMML model export	Vincenzo Selvaggio	2015-05-18	2	-0/+87
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-7272 Author: Vincenzo Selvaggio <vselvaggio@hotmail.it> Closes #6219 from selvinsource/mllib_pmml_model_export_SPARK-7272 and squashes the following commits: c866fb8 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md 1beda98 [Vincenzo Selvaggio] [SPARK-7272] Initial user guide for pmml export d670662 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md 2731375 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md 680dc33 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md 2e298b5 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md a932f51 [Vincenzo Selvaggio] Create mllib-pmml-model-export.md (cherry picked from commit 814b3dabdf01abc7a2f25aa32284caccadeb7798) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-6657] [PYSPARK] Fix doc warnings	Xiangrui Meng	2015-05-18	4	-10/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixed the following warnings in `make clean html` under `python/docs`: ~~~ /Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:3: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:4: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:3: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:4: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.replace:16: WARNING: Field list ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:8: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:9: WARNING: Block quote ends without a blank line; unexpected unindent. ~~~ davies Author: Xiangrui Meng <meng@databricks.com> Closes #6221 from mengxr/SPARK-6657 and squashes the following commits: e3f83fe [Xiangrui Meng] fix sql and streaming doc warnings 2b4371e [Xiangrui Meng] fix mllib python doc warnings (cherry picked from commit 1ecfac6e387b0934bfb5a9bbb4ad74b81ec210a4) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-7299][SQL] Set precision and scale for Decimal according to JDBC ↵	Liang-Chi Hsieh	2015-05-18	1	-4/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	metadata instead of returned BigDecimal JIRA: https://issues.apache.org/jira/browse/SPARK-7299 When connecting with oracle db through jdbc, the precision and scale of `BigDecimal` object returned by `ResultSet.getBigDecimal` is not correctly matched to the table schema reported by `ResultSetMetaData.getPrecision` and `ResultSetMetaData.getScale`. So in case you insert a value like `19999` into a column with `NUMBER(12, 2)` type, you get through a `BigDecimal` object with scale as 0. But the dataframe schema has correct type as `DecimalType(12, 2)`. Thus, after you save the dataframe into parquet file and then retrieve it, you will get wrong result `199.99`. Because it is reported to be problematic on jdbc connection with oracle db. It might be difficult to add test case for it. But according to the user's test on JIRA, it solves this problem. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5833 from viirya/jdbc_decimal_precision and squashes the following commits: 69bc2b5 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into jdbc_decimal_precision 928f864 [Liang-Chi Hsieh] Add comments. 5f9da94 [Liang-Chi Hsieh] Set up Decimal's precision and scale according to table schema instead of returned BigDecimal. (cherry picked from commit e32c0f69f38ad729e25c2d5f90eb73b4453f8279) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[SPARK-7694] [MLLIB] Use getOrElse for getting the threshold of LR model	Shuo Xiang	2015-05-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The `toString` method of `LogisticRegressionModel` calls `get` method on an Option (threshold) without a safeguard. In spark-shell, the following code `val model = algorithm.run(data).clearThreshold()` in lbfgs code will fail as `toString `method will be called right after `clearThreshold()` to show the results in the REPL. Author: Shuo Xiang <shuoxiangpub@gmail.com> Closes #6224 from coderxiang/getorelse and squashes the following commits: d5f53c9 [Shuo Xiang] use getOrElse for getting the threshold of LR model 5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master' 98804c9 [Shuo Xiang] fix bug in topBykey and update test (cherry picked from commit 775e6f9909d4495cbc11c377508b43482d782742) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-7693][Core] Remove "import ↵	zsxwing	2015-05-17	6	-26/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	scala.concurrent.ExecutionContext.Implicits.global" Learnt a lesson from SPARK-7655: Spark should avoid to use `scala.concurrent.ExecutionContext.Implicits.global` because the user may submit blocking actions to `scala.concurrent.ExecutionContext.Implicits.global` and exhaust all threads in it. This could crash Spark. So Spark should always use its own thread pools for safety. This PR removes all usages of `scala.concurrent.ExecutionContext.Implicits.global` and uses proper thread pools to replace them. Author: zsxwing <zsxwing@gmail.com> Closes #6223 from zsxwing/SPARK-7693 and squashes the following commits: a33ff06 [zsxwing] Decrease the max thread number from 1024 to 128 cf4b3fc [zsxwing] Remove "import scala.concurrent.ExecutionContext.Implicits.global" (cherry picked from commit ff71d34e00b64d70f671f9bf3e63aec39cd525e5) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[SQL] [MINOR] use catalyst type converter in ScalaUdf	Wenchen Fan	2015-05-17	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	It's a follow-up of https://github.com/apache/spark/pull/5154, we can speed up scala udf evaluation by create type converter in advance. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6182 from cloud-fan/tmp and squashes the following commits: 241cfe9 [Wenchen Fan] use converter in ScalaUdf (cherry picked from commit 2f22424e9f6624097b292cb70e00787b69d80718) Signed-off-by: Yin Huai <yhuai@databricks.com>
*	[SPARK-6514] [SPARK-5960] [SPARK-6656] [SPARK-7679] [STREAMING] [KINESIS] ↵	Tathagata Das	2015-05-17	6	-120/+348
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Updates to the Kinesis API SPARK-6514 - Use correct region SPARK-5960 - Allow AWS Credentials to be directly passed SPARK-6656 - Specify kinesis application name explicitly SPARK-7679 - Upgrade to latest KCL and AWS SDK. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6147 from tdas/kinesis-api-update and squashes the following commits: f23ea77 [Tathagata Das] Updated versions and updated APIs 373b201 [Tathagata Das] Updated Kinesis API (cherry picked from commit ca4257aec658aaa87f4f097dd7534033d5f13ddc) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-7491] [SQL] Allow configuration of classloader isolation for hive	Michael Armbrust	2015-05-17	3	-10/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #6167 from marmbrus/configureIsolation and squashes the following commits: 6147cbe [Michael Armbrust] filter other conf 22cc3bc7 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into configureIsolation 07476ee [Michael Armbrust] filter empty prefixes dfdf19c [Michael Armbrust] [SPARK-6906][SQL] Allow configuration of classloader isolation for hive (cherry picked from commit 2ca60ace8f42cf0bd4569d86c86c37a8a2b6a37c) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-7686] [SQL] DescribeCommand is assigned wrong output attributes in ↵	Josh Rosen	2015-05-17	2	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SparkStrategies In `SparkStrategies`, `RunnableDescribeCommand` is called with the output attributes of the table being described rather than the attributes for the `describe` command's output. I discovered this issue because it caused type conversion errors in some UnsafeRow conversion code that I'm writing. Author: Josh Rosen <joshrosen@databricks.com> Closes #6217 from JoshRosen/SPARK-7686 and squashes the following commits: 953a344 [Josh Rosen] Fix SPARK-7686 with a simple change in SparkStrategies. a4eec9f [Josh Rosen] Add failing regression test for SPARK-7686 (cherry picked from commit 564562874f589c4c8bcabcd9d6eb9a6b0eada938) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[SPARK-7660] Wrap SnappyOutputStream to work around snappy-java bug	Josh Rosen	2015-05-17	2	-10/+47
\| \| \| \| \| \| \| \| \| \| \| \| \|	This patch wraps `SnappyOutputStream` to ensure that `close()` is idempotent and to guard against write-after-`close()` bugs. This is a workaround for https://github.com/xerial/snappy-java/issues/107, a bug where a non-idempotent `close()` method can lead to stream corruption. We can remove this workaround if we upgrade to a snappy-java version that contains my fix for this bug, but in the meantime this patch offers a backportable Spark fix. Author: Josh Rosen <joshrosen@databricks.com> Closes #6176 from JoshRosen/SPARK-7660-wrap-snappy and squashes the following commits: 8b77aae [Josh Rosen] Wrap SnappyOutputStream to fix SPARK-7660 (cherry picked from commit f2cc6b5bccc3a70fd7d69183b1a068800831fe19) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-7669] Builds against Hadoop 2.6+ get inconsistent curator depend…	Steve Loughran	2015-05-17	1	-2/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds a new profile, `hadoop-2.6`, copying over the hadoop-2.4 properties, updating ZK to 3.4.6 and making the curator version a configurable option. That keeps the curator-recipes JAR in sync with that used in hadoop. There's one more option to consider: making the full curator-client version explicit with its own dependency version. This will pin down the version from hadoop and hive imports Author: Steve Loughran <stevel@hortonworks.com> Closes #6191 from steveloughran/stevel/SPARK-7669-hadoop-2.6 and squashes the following commits: e3e281a [Steve Loughran] SPARK-7669 declare the version of curator-client and curator-framework JARs 2901ea9 [Steve Loughran] SPARK-7669 Builds against Hadoop 2.6+ get inconsistent curator dependencies (cherry picked from commit 50217667cc1239ed3b15f4d10907b727ed85d7fa) Signed-off-by: Sean Owen <sowen@cloudera.com>
*	[SPARK-7447] [SQL] Don't re-merge Parquet schema when the relation is ↵	Liang-Chi Hsieh	2015-05-17	1	-14/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	deserialized JIRA: https://issues.apache.org/jira/browse/SPARK-7447 `MetadataCache` in `ParquetRelation2` is annotated as `transient`. When `ParquetRelation2` is deserialized, we ask `MetadataCache` to refresh and perform schema merging again. It is time-consuming especially for very many parquet files. With the new `FSBasedParquetRelation`, although `MetadataCache` is not `transient` now, `MetadataCache.refresh()` still performs schema merging again when the relation is deserialized. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6012 from viirya/without_remerge_schema and squashes the following commits: 2663957 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into without_remerge_schema 6ac7d93 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into without_remerge_schema b0fc09b [Liang-Chi Hsieh] Don't generate and merge parquetSchema multiple times. (cherry picked from commit 339905578790fa37fcad9684b859b443313a5aa2) Signed-off-by: Cheng Lian <lian@databricks.com>
*	[MINOR] Add 1.3, 1.3.1 to master branch EC2 scripts	Shivaram Venkataraman	2015-05-17	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cc pwendell P.S: I can't believe this was outdated all along ? Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6215 from shivaram/update-ec2-map and squashes the following commits: ae3937a [Shivaram Venkataraman] Add 1.3, 1.3.1 to master branch EC2 scripts (cherry picked from commit 1a7b9ce80bb5649796dda48d6a6d662a2809d0ef) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>