aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-7579] [ML] [DOC] User guide update for OneHotEncoderSandy Ryza2015-05-201-0/+95
| | | | | | | | Author: Sandy Ryza <sandy@cloudera.com> Closes #6126 from sryza/sandy-spark-7579 and squashes the following commits: 5af803d [Sandy Ryza] SPARK-7579 [MLLIB] User guide update for OneHotEncoder
* [SPARK-7537] [MLLIB] spark.mllib API updatesXiangrui Meng2015-05-202-0/+13
| | | | | | | | | | | | | | | | Minor updates to the spark.mllib APIs: 1. Add `DeveloperApi` to `PMMLExportable` and add `Experimental` to `toPMML` methods. 2. Mention `RankingMetrics.of` in the `RankingMetrics` constructor. Author: Xiangrui Meng <meng@databricks.com> Closes #6280 from mengxr/SPARK-7537 and squashes the following commits: 1bd2583 [Xiangrui Meng] organize imports 94afa7a [Xiangrui Meng] mark all toPMML methods experimental 4c40da1 [Xiangrui Meng] mention the factory method for RankingMetrics for Java users 88c62d0 [Xiangrui Meng] add DeveloperApi to PMMLExportable
* [SPARK-7713] [SQL] Use shared broadcast hadoop conf for partitioned table scan.Yin Huai2015-05-204-48/+387
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-7713 I tested the performance with the following code: ```scala import sqlContext._ import sqlContext.implicits._ (1 to 5000).foreach { i => val df = (1 to 1000).map(j => (j, s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") } sqlContext.sql(""" CREATE TEMPORARY TABLE partitionedParquet USING org.apache.spark.sql.parquet OPTIONS ( path '/tmp/partitioned' )""") table("partitionedParquet").explain(true) ``` In our master `explain` takes 40s in my laptop. With this PR, `explain` takes 14s. Author: Yin Huai <yhuai@databricks.com> Closes #6252 from yhuai/broadcastHadoopConf and squashes the following commits: 6fa73df [Yin Huai] Address comments of Josh and Andrew. 807fbf9 [Yin Huai] Make the new buildScan and SqlNewHadoopRDD private sql. e393555 [Yin Huai] Cheng's comments. 2eb53bb [Yin Huai] Use a shared broadcast Hadoop Configuration for partitioned HadoopFsRelations.
* [SPARK-6094] [MLLIB] Add MultilabelMetrics in PySpark/MLlibYanbo Liang2015-05-202-0/+125
| | | | | | | | | | Add MultilabelMetrics in PySpark/MLlib Author: Yanbo Liang <ybliang8@gmail.com> Closes #6276 from yanboliang/spark-6094 and squashes the following commits: b8e3343 [Yanbo Liang] Add MultilabelMetrics in PySpark/MLlib
* [SPARK-7654] [MLLIB] Migrate MLlib to the DataFrame reader/writer APIXiangrui Meng2015-05-2010-12/+12
| | | | | | | | | | parquetFile -> read.parquet rxin Author: Xiangrui Meng <meng@databricks.com> Closes #6281 from mengxr/SPARK-7654 and squashes the following commits: a79b612 [Xiangrui Meng] parquetFile -> read.parquet
* [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats.ehnalis2015-05-202-10/+39
| | | | | | | | | | | | | | | | Added faster RM-heartbeats on pending container allocations with multiplicative back-off. Also updated related documentations. Author: ehnalis <zoltan.zvara@gmail.com> Closes #6082 from ehnalis/yarn and squashes the following commits: a1d2101 [ehnalis] MIss-spell fixed. 90f8ba4 [ehnalis] Changed default HB values. 6120295 [ehnalis] Removed the bug, when allocation heartbeat would not start from initial value. 08bac63 [ehnalis] Refined style, grammar, removed duplicated code. 073d283 [ehnalis] [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats. d4408c9 [ehnalis] [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats.
* [SPARK-7320] [SQL] Add Cube / Rollup for dataframeCheng Hao2015-05-203-28/+230
| | | | | | | | | | | | | | | | | | | | | Add `cube` & `rollup` for DataFrame For example: ```scala testData.rollup($"a" + $"b", $"b").agg(sum($"a" - $"b")) testData.cube($"a" + $"b", $"b").agg(sum($"a" - $"b")) ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #6257 from chenghao-intel/rollup and squashes the following commits: 7302319 [Cheng Hao] cancel the implicit keyword a66e38f [Cheng Hao] remove the unnecessary code changes a2869d4 [Cheng Hao] update the code as comments c441777 [Cheng Hao] update the code as suggested 84c9564 [Cheng Hao] Remove the CubedData & RollupedData 279584c [Cheng Hao] hiden the CubedData & RollupedData ef357e1 [Cheng Hao] Add Cube / Rollup for dataframe
* [SPARK-7663] [MLLIB] Add requirement for word2vec modelXusen Yin2015-05-201-0/+3
| | | | | | | | | | | | | | | JIRA issue [link](https://issues.apache.org/jira/browse/SPARK-7663). We should check the model size of word2vec, to prevent the unexpected empty. CC srowen. Author: Xusen Yin <yinxusen@gmail.com> Closes #6228 from yinxusen/SPARK-7663 and squashes the following commits: 21770c5 [Xusen Yin] check the vocab size 54ae63e [Xusen Yin] add requirement for word2vec model
* [SPARK-7656] [SQL] use CatalystConf in FunctionRegistryscwf2015-05-193-7/+9
| | | | | | | | | | follow up for #5806 Author: scwf <wangfei1@huawei.com> Closes #6164 from scwf/FunctionRegistry and squashes the following commits: 15e6697 [scwf] use catalogconf in FunctionRegistry
* [SPARK-7744] [DOCS] [MLLIB] Distributed matrix" section in MLlib "Data ↵Mike Dusenberry2015-05-191-64/+64
| | | | | | | | | | | | Types" documentation should be reordered. The documentation for BlockMatrix should come after RowMatrix, IndexedRowMatrix, and CoordinateMatrix, as BlockMatrix references the later three types, and RowMatrix is considered the "basic" distributed matrix. This will improve comprehensibility of the "Distributed matrix" section, especially for the new reader. Author: Mike Dusenberry <dusenberrymw@gmail.com> Closes #6270 from dusenberrymw/Reorder_MLlib_Data_Types_Distributed_matrix_docs and squashes the following commits: 6313bab [Mike Dusenberry] The documentation for BlockMatrix should come after RowMatrix, IndexedRowMatrix, and CoordinateMatrix, as BlockMatrix references the later three types, and RowMatrix is considered the "basic" distributed matrix. This will improve comprehensibility of the "Distributed matrix" section, especially for the new reader.
* [SPARK-6246] [EC2] fixed support for more than 100 nodesalyaxey2015-05-191-1/+5
| | | | | | | | | | This is a small fix. But it is important for amazon users because as the ticket states, "spark-ec2 can't handle clusters with > 100 nodes" now. Author: alyaxey <oleksii.sliusarenko@grammarly.com> Closes #6267 from alyaxey/ec2_100_nodes_fix and squashes the following commits: 1e0d747 [alyaxey] [SPARK-6246] fixed support for more than 100 nodes
* [SPARK-7662] [SQL] Resolve correct names for generator in projectionCheng Hao2015-05-193-4/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ``` select explode(map(value, key)) from src; ``` Throws exception ``` org.apache.spark.sql.AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got _c0 ; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #6178 from chenghao-intel/explode and squashes the following commits: 916fbe9 [Cheng Hao] add more strict rules for TGF alias 5c3f2c5 [Cheng Hao] fix bug in unit test e1d93ab [Cheng Hao] Add more unit test 19db09e [Cheng Hao] resolve names for generator in projection
* [SPARK-7738] [SQL] [PySpark] add reader and writer API in PythonDavies Liu2015-05-196-92/+430
| | | | | | | | | | | | | | | cc rxin, please take a quick look, I'm working on tests. Author: Davies Liu <davies@databricks.com> Closes #6238 from davies/readwrite and squashes the following commits: c7200eb [Davies Liu] update tests 9cbf01b [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite f0c5a04 [Davies Liu] use sqlContext.read.load 5f68bc8 [Davies Liu] update tests 6437e9a [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite bcc6668 [Davies Liu] add reader amd writer API in Python
* [SPARK-7652] [MLLIB] Update the implementation of naive Bayes prediction ↵Liang-Chi Hsieh2015-05-191-17/+24
| | | | | | | | | | | | | | | | with BLAS JIRA: https://issues.apache.org/jira/browse/SPARK-7652 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6189 from viirya/naive_bayes_blas_prediction and squashes the following commits: ab611fd [Liang-Chi Hsieh] Remove unnecessary space. ddc48b9 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into naive_bayes_blas_prediction b5772b4 [Liang-Chi Hsieh] Fix binary compatibility. 2f65186 [Liang-Chi Hsieh] Remove toDense. 1b6cdfe [Liang-Chi Hsieh] Update the implementation of naive Bayes prediction with BLAS.
* [SPARK-7586] [ML] [DOC] Add docs of Word2Vec in ml packageXusen Yin2015-05-192-0/+165
| | | | | | | | | | | | | | | | | CC jkbradley. JIRA [issue](https://issues.apache.org/jira/browse/SPARK-7586). Author: Xusen Yin <yinxusen@gmail.com> Closes #6181 from yinxusen/SPARK-7586 and squashes the following commits: 77014c5 [Xusen Yin] comment fix 57a4c07 [Xusen Yin] small fix for docs 1178c8f [Xusen Yin] remove the correctness check in java suite 1c3f389 [Xusen Yin] delete sbt commit 1af152b [Xusen Yin] check python example code 1b5369e [Xusen Yin] add docs of word2vec
* [SPARK-7726] Fix Scaladoc false errorsIulian Dragos2015-05-196-3/+15
| | | | | | | | | | | | | Visibility rules for static members are different in Scala and Java, and this case requires an explicit static import. Even though these are Java files, they are run through scaladoc, which enforces Scala rules. Also reverted the commit that reverts the upgrade to 2.11.6 Author: Iulian Dragos <jaguarul@gmail.com> Closes #6260 from dragos/issue/scaladoc-false-error and squashes the following commits: f2e998e [Iulian Dragos] Revert "[HOTFIX] Revert "[SPARK-7092] Update spark scala version to 2.11.6"" 0bad052 [Iulian Dragos] Fix scaladoc faux-error.
* [SPARK-7678] [ML] Fix default random seed in HasSeedJoseph K. Bradley2015-05-196-12/+14
| | | | | | | | | | | | | | Changed shared param HasSeed to have default based on hashCode of class name, instead of random number. Also, removed fixed random seeds from Word2Vec and ALS. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #6251 from jkbradley/scala-fixed-seed and squashes the following commits: 0e37184 [Joseph K. Bradley] Fixed Word2VecSuite, ALSSuite in spark.ml to use original fixed random seeds 678ec3a [Joseph K. Bradley] Removed fixed random seeds from Word2Vec and ALS. Changed shared param HasSeed to have default based on hashCode of class name, instead of random number.
* [SPARK-7047] [ML] ml.Model optional parent supportJoseph K. Bradley2015-05-193-1/+7
| | | | | | | | | | | | Made Model.parent transient. Added Model.hasParent to test for null parent CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #5914 from jkbradley/parent-optional and squashes the following commits: d501774 [Joseph K. Bradley] Made Model.parent transient. Added Model.hasParent to test for null parent
* [SPARK-7704] Updating Programming Guides per SPARK-4397Dice2015-05-191-6/+5
| | | | | | | | | | | | | The change per SPARK-4397 makes implicit objects in SparkContext to be found by the compiler automatically. So that we don't need to import the o.a.s.SparkContext._ explicitly any more and can remove some statements around the "implicit conversions" from the latest Programming Guides (1.3.0 and higher) Author: Dice <poleon.kd@gmail.com> Closes #6234 from daisukebe/patch-1 and squashes the following commits: b77ecd9 [Dice] fix a typo 45dfcd3 [Dice] rewording per Sean's advice a094bcf [Dice] Adding a note for users on any previous releases a29be5f [Dice] Updating Programming Guides per SPARK-4397
* [SPARK-7681] [MLLIB] remove mima excludes for 1.3Xiangrui Meng2015-05-191-8/+1
| | | | | | | | | | There excludes are unnecessary for 1.3 because the changes were made in 1.4.x. Author: Xiangrui Meng <meng@databricks.com> Closes #6254 from mengxr/SPARK-7681-mima and squashes the following commits: 7f0cea0 [Xiangrui Meng] remove mima excludes for 1.3
* [SPARK-7723] Fix string interpolation in pipeline examplesSaleem Ansari2015-05-191-2/+2
| | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-7723 Author: Saleem Ansari <tuxdna@gmail.com> Closes #6258 from tuxdna/master and squashes the following commits: 2bb5a42 [Saleem Ansari] Merge branch 'master' into mllib-pipeline e39db9c [Saleem Ansari] Fix string interpolation in pipeline examples
* [HOTFIX] Revert "[SPARK-7092] Update spark scala version to 2.11.6"Patrick Wendell2015-05-192-3/+3
| | | | | | | This reverts commit a11c8683c76c67f45749a1b50a0912a731fd2487. For more information see: https://issues.apache.org/jira/browse/SPARK-7726
* Fixing a few basic typos in the Programming Guide.Mike Dusenberry2015-05-191-3/+3
| | | | | | | | | | Just a few minor fixes in the guide, so a new JIRA issue was not created per the guidelines. Author: Mike Dusenberry <dusenberrymw@gmail.com> Closes #6240 from dusenberrymw/Fix_Programming_Guide_Typos and squashes the following commits: ffa76eb [Mike Dusenberry] Fixing a few basic typos in the Programming Guide.
* [SPARK-7581] [ML] [DOC] User guide for spark.ml PolynomialExpansionXusen Yin2015-05-192-0/+174
| | | | | | | | | | | | | | | | | | JIRA [here](https://issues.apache.org/jira/browse/SPARK-7581). CC jkbradley Author: Xusen Yin <yinxusen@gmail.com> Closes #6113 from yinxusen/SPARK-7581 and squashes the following commits: 1a7d80d [Xusen Yin] merge with master 892a8e9 [Xusen Yin] fix python 3 compatibility ec935bf [Xusen Yin] small fix 3e9fa1d [Xusen Yin] delete note 69fcf85 [Xusen Yin] simplify and add python example 81d21dc [Xusen Yin] add programming guide for Polynomial Expansion 40babfb [Xusen Yin] add java test suite for PolynomialExpansion
* [HOTFIX] Fixing style failures in Kinesis sourcePatrick Wendell2015-05-192-4/+6
|
* [HOTFIX]: Java 6 Build BreaksPatrick Wendell2015-05-192-15/+2
| | | | These were blocking RC1 so I fixed them manually.
* [SPARK-7687] [SQL] DataFrame.describe() should cast all aggregates to StringJosh Rosen2015-05-183-14/+19
| | | | | | | | | | | | | In `DataFrame.describe()`, the `count` aggregate produces an integer, the `avg` and `stdev` aggregates produce doubles, and `min` and `max` aggregates can produce varying types depending on what type of column they're applied to. As a result, we should cast all aggregate results to String so that `describe()`'s output types match its declared output schema. Author: Josh Rosen <joshrosen@databricks.com> Closes #6218 from JoshRosen/SPARK-7687 and squashes the following commits: 146b615 [Josh Rosen] Fix R test. 2974bd5 [Josh Rosen] Cast to string type instead f206580 [Josh Rosen] Cast to double to fix SPARK-7687 307ecbf [Josh Rosen] Add failing regression test for SPARK-7687
* [SPARK-7150] SparkContext.range() and SQLContext.range()Daoyuan Wang2015-05-187-0/+189
| | | | | | | | | | | | | | | | | | | | | | This PR is based on #6081, thanks adrian-wang. Closes #6081 Author: Daoyuan Wang <daoyuan.wang@intel.com> Author: Davies Liu <davies@databricks.com> Closes #6230 from davies/range and squashes the following commits: d3ce5fe [Davies Liu] add tests 789eda5 [Davies Liu] add range() in Python 4590208 [Davies Liu] Merge commit 'refs/pull/6081/head' of github.com:apache/spark into range cbf5200 [Daoyuan Wang] let's add python support in a separate PR f45e3b2 [Daoyuan Wang] remove redundant toLong 617da76 [Daoyuan Wang] fix safe marge for corner cases 867c417 [Daoyuan Wang] fix 13dbe84 [Daoyuan Wang] update bd998ba [Daoyuan Wang] update comments d3a0c1b [Daoyuan Wang] add range api()
* [SPARK-7681] [MLLIB] Add SparseVector support for gemvLiang-Chi Hsieh2015-05-184-33/+240
| | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-7681 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6209 from viirya/sparsevector_gemv and squashes the following commits: ce0bb8b [Liang-Chi Hsieh] Still need to scal y when beta is 0.0 because it clears out y. b890e63 [Liang-Chi Hsieh] Do not delete multiply for DenseVector. 57a8c1e [Liang-Chi Hsieh] Add MimaExcludes for v1.4. 458d1ae [Liang-Chi Hsieh] List DenseMatrix.multiply and SparseMatrix.multiply to MimaExcludes too. 054f05d [Liang-Chi Hsieh] Fix scala style. 410381a [Liang-Chi Hsieh] Address comments. Make Matrix.multiply more generalized. 4616696 [Liang-Chi Hsieh] Add support for SparseVector with SparseMatrix. 5d6d07a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sparsevector_gemv c069507 [Liang-Chi Hsieh] Add SparseVector support for gemv with DenseMatrix.
* [SPARK-7692] Updated Kinesis examplesTathagata Das2015-05-182-237/+268
| | | | | | | | | | | | | | | | - Updated Kinesis examples to use stable API - Cleaned up comments, etc. - Renamed KinesisWordCountProducerASL to KinesisWordProducerASL Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6249 from tdas/kinesis-examples and squashes the following commits: 7cc307b [Tathagata Das] More tweaks f080872 [Tathagata Das] More cleanup 841987f [Tathagata Das] Small update 011cbe2 [Tathagata Das] More fixes b0d74f9 [Tathagata Das] Updated examples.
* [SPARK-7621] [STREAMING] Report Kafka errors to StreamingListenersjerluc2015-05-182-2/+2
| | | | | | | | | | PR per [SPARK-7621](https://issues.apache.org/jira/browse/SPARK-7621), which makes both `KafkaReceiver` and `ReliableKafkaReceiver` report its errors to the `ReceiverTracker`, which in turn will add the events to the bus to fire off any registered `StreamingListener`s. Author: jerluc <jeremyalucas@gmail.com> Closes #6204 from jerluc/master and squashes the following commits: 82439a5 [jerluc] [SPARK-7621] [STREAMING] Report Kafka errors to StreamingListeners
* [SPARK-7624] Revert #4147Davies Liu2015-05-181-21/+2
| | | | | | | | Author: Davies Liu <davies@databricks.com> Closes #6172 from davies/revert_4147 and squashes the following commits: 3bfbbde [Davies Liu] Revert #4147
* [SQL] Fix serializability of ORC table scanMichael Armbrust2015-05-181-1/+1
| | | | | | | | | | A follow-up to #6244. Author: Michael Armbrust <michael@databricks.com> Closes #6247 from marmbrus/fixOrcTests and squashes the following commits: e39ee1b [Michael Armbrust] [SQL] Fix serializability of ORC table scan
* [SPARK-7063] when lz4 compression is used, it causes core dumpJihong MA2015-05-181-1/+1
| | | | | | | | | | | | | | | | | | | | this fix is to solve one issue found in lz4 1.2.0, which caused core dump in Spark Core with IBM JDK. that issue is fixed in lz4 1.3.0 version. Author: Jihong MA <linlin200605@gmail.com> Closes #6226 from JihongMA/SPARK-7063-1 and squashes the following commits: 0cca781 [Jihong MA] SPARK-7063 4559ed5 [Jihong MA] SPARK-7063 daa520f [Jihong MA] SPARK-7063 upgrade lz4 jars 71738ee [Jihong MA] Merge remote-tracking branch 'upstream/master' dfaa971 [Jihong MA] SPARK-7265 minor fix of the content ace454d [Jihong MA] SPARK-7265 take out PySpark on YARN limitation 9ea0832 [Jihong MA] Merge remote-tracking branch 'upstream/master' d5bf3f5 [Jihong MA] Merge remote-tracking branch 'upstream/master' 7b842e6 [Jihong MA] Merge remote-tracking branch 'upstream/master' 9c84695 [Jihong MA] SPARK-7265 address review comment a399aa6 [Jihong MA] SPARK-7265 Improving documentation for Spark SQL Hive support
* [SPARK-7501] [STREAMING] DAG visualization: show DStream operationsAndrew Or2015-05-1814-145/+484
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is similar to #5999, but for streaming. Roughly 200 lines are tests. One thing to note here is that we already do some kind of scoping thing for call sites, so this patch adds the new RDD operation scoping logic in the same place. Also, this patch adds a `try finally` block to set the relevant variables in a safer way. tdas zsxwing ------------------------ **Before** <img src="https://cloud.githubusercontent.com/assets/2133137/7625996/d88211b8-f9b4-11e4-90b9-e11baa52d6d7.png" width="450px"/> -------------------------- **After** <img src="https://cloud.githubusercontent.com/assets/2133137/7625997/e0878f8c-f9b4-11e4-8df3-7dd611b13c87.png" width="650px"/> Author: Andrew Or <andrew@databricks.com> Closes #6034 from andrewor14/dag-viz-streaming and squashes the following commits: 932a64a [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming e685df9 [Andrew Or] Rename createRDDWith 84d0656 [Andrew Or] Review feedback 697c086 [Andrew Or] Fix tests 53b9936 [Andrew Or] Set scopes for foreachRDD properly 1881802 [Andrew Or] Refactor DStream scope names again af4ba8d [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming fd07d22 [Andrew Or] Make MQTT lower case f6de871 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming 0ca1801 [Andrew Or] Remove a few unnecessary withScopes on aliases fa4e5fb [Andrew Or] Pass in input stream name rather than defining it from within 1af0b0e [Andrew Or] Fix style 074c00b [Andrew Or] Review comments d25a324 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming e4a93ac [Andrew Or] Fix tests? 25416dc [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming 9113183 [Andrew Or] Add tests for DStream scopes b3806ab [Andrew Or] Fix test bb80bbb [Andrew Or] Fix MIMA? 5c30360 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming 5703939 [Andrew Or] Rename operations that create InputDStreams 7c4513d [Andrew Or] Group RDDs by DStream operations and batches bf0ab6e [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming 05c2676 [Andrew Or] Wrap many more methods in withScope c121047 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming 65ef3e9 [Andrew Or] Fix NPE a0d3263 [Andrew Or] Scope streaming operations instead of RDD operations
* [HOTFIX] Fix ORC build breakMichael Armbrust2015-05-181-5/+6
| | | | | | | | | | Fix break caused by merging #6225 and #6194. Author: Michael Armbrust <michael@databricks.com> Closes #6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break
* [SPARK-7658] [STREAMING] [WEBUI] Update the mouse behaviors for the timeline ↵zsxwing2015-05-183-2/+47
| | | | | | | | | | | | | | | | | | | | | | | | graphs 1. If the user click one point of a batch, scroll down to the corresponding batch row and highlight it. And recovery the batch row after 3 seconds if necessary. 2. Add "#batches" in the histogram graphs. ![screen shot 2015-05-14 at 7 36 19 pm](https://cloud.githubusercontent.com/assets/1000778/7646108/84f4a014-fa73-11e4-8c13-1903d267e60f.png) ![screen shot 2015-05-14 at 7 36 53 pm](https://cloud.githubusercontent.com/assets/1000778/7646109/8b11154a-fa73-11e4-820b-8ece9fa6ee3e.png) ![screen shot 2015-05-14 at 7 36 34 pm](https://cloud.githubusercontent.com/assets/1000778/7646111/93828272-fa73-11e4-89f8-580670144d3c.png) Author: zsxwing <zsxwing@gmail.com> Closes #6168 from zsxwing/SPARK-7658 and squashes the following commits: c242b00 [zsxwing] Change 5 seconds to 3 seconds 31fd0aa [zsxwing] Remove the mouseover highlight feature 06c6f6f [zsxwing] Merge branch 'master' into SPARK-7658 2eaff06 [zsxwing] Merge branch 'master' into SPARK-7658 108d56c [zsxwing] Update the mouse behaviors for the timeline graphs
* [SPARK-6216] [PYSPARK] check python version of worker with driverDavies Liu2015-05-1810-14/+26
| | | | | | | | | | | | This PR revert #5404, change to pass the version of python in driver into JVM, check it in worker before deserializing closure, then it can works with different major version of Python. Author: Davies Liu <davies@databricks.com> Closes #6203 from davies/py_version and squashes the following commits: b8fb76e [Davies Liu] fix test 6ce5096 [Davies Liu] use string for version 47c6278 [Davies Liu] check python version of worker with driver
* [SPARK-7673] [SQL] WIP: HadoopFsRelation and ParquetRelation2 performance ↵Cheng Lian2015-05-184-91/+117
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | optimizations This PR introduces several performance optimizations to `HadoopFsRelation` and `ParquetRelation2`: 1. Moving `FileStatus` listing from `DataSourceStrategy` into a cache within `HadoopFsRelation`. This new cache generalizes and replaces the one used in `ParquetRelation2`. This also introduces an interface change: to reuse cached `FileStatus` objects, `HadoopFsRelation.buildScan` methods now receive `Array[FileStatus]` instead of `Array[String]`. 1. When Parquet task side metadata reading is enabled, skip reading row group information when reading Parquet footers. This is basically what PR #5334 does. Also, now we uses `ParquetFileReader.readAllFootersInParallel` to read footers in parallel. Another optimization in question is, instead of asking `HadoopFsRelation.buildScan` to return an `RDD[Row]` for a single selected partition and then union them all, we ask it to return an `RDD[Row]` for all selected partitions. This optimization is based on the fact that Hadoop configuration broadcasting used in `NewHadoopRDD` takes 34% time in the following microbenchmark. However, this complicates data source user code because user code must merge partition values manually. To check the cost of broadcasting in `NewHadoopRDD`, I also did microbenchmark after removing the `broadcast` call in `NewHadoopRDD`. All results are shown below. ### Microbenchmark #### Preparation code Generating a partitioned table with 50k partitions, 1k rows per partition: ```scala import sqlContext._ import sqlContext.implicits._ for (n <- 0 until 500) { val data = for { p <- (n * 10) until ((n + 1) * 10) i <- 0 until 1000 } yield (i, f"val_$i%04d", f"$p%04d") data. toDF("a", "b", "p"). write. partitionBy("p"). mode("append"). parquet(path) } ``` #### Benchmarking code ```scala import sqlContext._ import sqlContext.implicits._ import org.apache.spark.sql.types._ import com.google.common.base.Stopwatch val path = "hdfs://localhost:9000/user/lian/5k" def benchmark(n: Int)(f: => Unit) { val stopwatch = new Stopwatch() def run() = { stopwatch.reset() stopwatch.start() f stopwatch.stop() stopwatch.elapsedMillis() } val records = (0 until n).map(_ => run()) (0 until n).foreach(i => println(s"Round $i: ${records(i)} ms")) println(s"Average: ${records.sum / n.toDouble} ms") } benchmark(3) { read.parquet(path).explain(extended = true) } ``` #### Results Before: ``` Round 0: 72528 ms Round 1: 68938 ms Round 2: 65372 ms Average: 68946.0 ms ``` After: ``` Round 0: 59499 ms Round 1: 53645 ms Round 2: 53844 ms Round 3: 49093 ms Round 4: 50555 ms Average: 53327.2 ms ``` Also removing Hadoop configuration broadcasting: (Note that I was testing on a local laptop, thus network cost is pretty low.) ``` Round 0: 15806 ms Round 1: 14394 ms Round 2: 14699 ms Round 3: 15334 ms Round 4: 14123 ms Average: 14871.2 ms ``` Author: Cheng Lian <lian@databricks.com> Closes #6225 from liancheng/spark-7673 and squashes the following commits: 2d58a2b [Cheng Lian] Skips reading row group information when using task side metadata reading 7aa3748 [Cheng Lian] Optimizes FileStatusCache by introducing a map from parent directories to child files ba41250 [Cheng Lian] Reuses HadoopFsRelation FileStatusCache in ParquetRelation2 3d278f7 [Cheng Lian] Fixes a bug when reading a single Parquet data file b84612a [Cheng Lian] Fixes Scala style issue 6a08b02 [Cheng Lian] WIP: Moves file status cache into HadoopFSRelation
* [SPARK-7567] [SQL] [follow-up] Use a new flag to set output committer based ↵Yin Huai2015-05-184-9/+29
| | | | | | | | | | | | on mapreduce apis cc liancheng marmbrus Author: Yin Huai <yhuai@databricks.com> Closes #6130 from yhuai/directOutput and squashes the following commits: 312b07d [Yin Huai] A data source can use spark.sql.sources.outputCommitterClass to override the output committer.
* [SPARK-7269] [SQL] Incorrect analysis for aggregation(use semanticEquals)Wenchen Fan2015-05-186-26/+48
| | | | | | | | | | | | A modified version of https://github.com/apache/spark/pull/6110, use `semanticEquals` to make it more efficient. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6173 from cloud-fan/7269 and squashes the following commits: e4a3cc7 [Wenchen Fan] address comments cc02045 [Wenchen Fan] consider elements length equal d7ff8f4 [Wenchen Fan] fix 7269
* [SPARK-7631] [SQL] treenode argString should not print childrenscwf2015-05-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | spark-sql> > explain extended > select * from ( > select key from src union all > select key from src) t; now the spark plan will print children in argString ``` == Physical Plan == Union[ HiveTableScan key#1, (MetastoreRelation default, src, None), None, HiveTableScan key#3, (MetastoreRelation default, src, None), None] HiveTableScan key#1, (MetastoreRelation default, src, None), None HiveTableScan key#3, (MetastoreRelation default, src, None), None ``` after this patch: ``` == Physical Plan == Union HiveTableScan [key#1], (MetastoreRelation default, src, None), None HiveTableScan [key#3], (MetastoreRelation default, src, None), None ``` I have tested this locally Author: scwf <wangfei1@huawei.com> Closes #6144 from scwf/fix-argString and squashes the following commits: 1a642e0 [scwf] fix treenode argString
* [SPARK-2883] [SQL] ORC data source for Spark SQLZhan Zhang2015-05-1814-76/+1477
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR updates PR #6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR #3753). Author: Zhan Zhang <zhazhan@gmail.com> Author: Cheng Lian <lian@databricks.com> Closes #6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @since and @Experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support
* [SPARK-7380] [MLLIB] pipeline stages should be copyable in PythonXiangrui Meng2015-05-1816-261/+498
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes: 1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively. 2. Accept a list of param maps in `fit`. 3. Use parent uid and name to identify param. jkbradley Author: Xiangrui Meng <meng@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #6088 from mengxr/SPARK-7380 and squashes the following commits: 413c463 [Xiangrui Meng] remove unnecessary doc 4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380 611c719 [Xiangrui Meng] fix python style 68862b8 [Xiangrui Meng] update _java_obj initialization 927ad19 [Xiangrui Meng] fix ml/tests.py 0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer 9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params 7e0d27f [Xiangrui Meng] merge master 46840fb [Xiangrui Meng] update wrappers b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap 46cb6ed [Xiangrui Meng] merge master a163413 [Xiangrui Meng] fix style 1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380 9630eae [Xiangrui Meng] fix Identifiable._randomUID 13bd70a [Xiangrui Meng] update ml/tests.py 64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl 02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python 66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui 7431272 [Joseph K. Bradley] Rebased with master
* [SQL] [MINOR] [THIS] use private for internal field in ScalaUdfWenchen Fan2015-05-181-4/+4
| | | | | | | | Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6235 from cloud-fan/tmp and squashes the following commits: 8f16367 [Wenchen Fan] use private[this]
* [SPARK-7570] [SQL] Ignores _temporary during partition discoveryCheng Lian2015-05-182-19/+27
| | | | | | | | | | | | <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6091) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #6091 from liancheng/spark-7570 and squashes the following commits: 8ff07e8 [Cheng Lian] Ignores _temporary during partition discovery
* [SPARK-6888] [SQL] Make the jdbc driver handling user-definableRene Treffer2015-05-186-126/+295
| | | | | | | | | | | | | | | | | | | | | | | Replace the DriverQuirks with JdbcDialect(s) (and MySQLDialect/PostgresDialect) and allow developers to change the dialects on the fly (for new JDBCRRDs only). Some types (like an unsigned 64bit number) can be trivially mapped to java. The status quo is that the RRD will fail to load. This patch makes it possible to overwrite the type mapping to read e.g. 64Bit numbers as strings and handle them afterwards in software. JDBCSuite has an example that maps all types to String, which should always work (at the cost of extra code afterwards). As a side effect it should now be possible to develop simple dialects out-of-tree and even with spark-shell. Author: Rene Treffer <treffer@measite.de> Closes #5555 from rtreffer/jdbc-dialects and squashes the following commits: 3cbafd7 [Rene Treffer] [SPARK-6888] ignore classes belonging to changed API in MIMA report fe7e2e8 [Rene Treffer] [SPARK-6888] Make the jdbc driver handling user-definable
* [SPARK-7627] [SPARK-7472] DAG visualization: style skipped stagesAndrew Or2015-05-186-108/+352
| | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes two things: **SPARK-7627.** Cached RDDs no longer light up on the job page. This is a simple fix. **SPARK-7472.** Display skipped stages differently from normal stages. The latter is a major UX issue. Because we link the job viz to the stage viz even for skipped stages, the user may inadvertently click into the stage page of a skipped stage, which is empty. ------------------- <img src="https://cloud.githubusercontent.com/assets/2133137/7675241/de1a3da6-fcea-11e4-8101-88055cef78c5.png" width="300px" /> Author: Andrew Or <andrew@databricks.com> Closes #6171 from andrewor14/dag-viz-skipped and squashes the following commits: f261797 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped 0eda358 [Andrew Or] Tweak skipped stage border color c604150 [Andrew Or] Tweak grayscale colors 7010676 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped 762b541 [Andrew Or] Use special prefix for stage clusters to avoid collisions 51c95b9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped b928cd4 [Andrew Or] Fix potential leak + write tests for it 7c4c364 [Andrew Or] Show skipped stages differently 7cc34ce [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped c121fa2 [Andrew Or] Fix cache color
* [SPARK-7272] [MLLIB] User guide for PMML model exportVincenzo Selvaggio2015-05-182-0/+87
| | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-7272 Author: Vincenzo Selvaggio <vselvaggio@hotmail.it> Closes #6219 from selvinsource/mllib_pmml_model_export_SPARK-7272 and squashes the following commits: c866fb8 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md 1beda98 [Vincenzo Selvaggio] [SPARK-7272] Initial user guide for pmml export d670662 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md 2731375 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md 680dc33 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md 2e298b5 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md a932f51 [Vincenzo Selvaggio] Create mllib-pmml-model-export.md
* [SPARK-6657] [PYSPARK] Fix doc warningsXiangrui Meng2015-05-184-10/+11
| | | | | | | | | | | | | | | | | | | | | | | Fixed the following warnings in `make clean html` under `python/docs`: ~~~ /Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:3: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:4: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:3: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:4: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.replace:16: WARNING: Field list ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:8: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:9: WARNING: Block quote ends without a blank line; unexpected unindent. ~~~ davies Author: Xiangrui Meng <meng@databricks.com> Closes #6221 from mengxr/SPARK-6657 and squashes the following commits: e3f83fe [Xiangrui Meng] fix sql and streaming doc warnings 2b4371e [Xiangrui Meng] fix mllib python doc warnings