aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-7744] [DOCS] [MLLIB] Distributed matrix" section in MLlib "Data ↵Mike Dusenberry2015-05-191-64/+64
| | | | | | | | | | | | | | | Types" documentation should be reordered. The documentation for BlockMatrix should come after RowMatrix, IndexedRowMatrix, and CoordinateMatrix, as BlockMatrix references the later three types, and RowMatrix is considered the "basic" distributed matrix. This will improve comprehensibility of the "Distributed matrix" section, especially for the new reader. Author: Mike Dusenberry <dusenberrymw@gmail.com> Closes #6270 from dusenberrymw/Reorder_MLlib_Data_Types_Distributed_matrix_docs and squashes the following commits: 6313bab [Mike Dusenberry] The documentation for BlockMatrix should come after RowMatrix, IndexedRowMatrix, and CoordinateMatrix, as BlockMatrix references the later three types, and RowMatrix is considered the "basic" distributed matrix. This will improve comprehensibility of the "Distributed matrix" section, especially for the new reader. (cherry picked from commit 3860520633770cc5719b2cdebe6dc3608798386d) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7662] [SQL] Resolve correct names for generator in projectionCheng Hao2015-05-193-4/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ``` select explode(map(value, key)) from src; ``` Throws exception ``` org.apache.spark.sql.AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got _c0 ; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #6178 from chenghao-intel/explode and squashes the following commits: 916fbe9 [Cheng Hao] add more strict rules for TGF alias 5c3f2c5 [Cheng Hao] fix bug in unit test e1d93ab [Cheng Hao] Add more unit test 19db09e [Cheng Hao] resolve names for generator in projection (cherry picked from commit bcb1ff81468eb4afc7c03b2bca18e99cc1ccf6b8) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-7738] [SQL] [PySpark] add reader and writer API in PythonDavies Liu2015-05-196-92/+430
| | | | | | | | | | | | | | | | | | cc rxin, please take a quick look, I'm working on tests. Author: Davies Liu <davies@databricks.com> Closes #6238 from davies/readwrite and squashes the following commits: c7200eb [Davies Liu] update tests 9cbf01b [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite f0c5a04 [Davies Liu] use sqlContext.read.load 5f68bc8 [Davies Liu] update tests 6437e9a [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite bcc6668 [Davies Liu] add reader amd writer API in Python (cherry picked from commit 4de74d2602f6577c3c8458aa85377e89c19724ca) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-7652] [MLLIB] Update the implementation of naive Bayes prediction ↵Liang-Chi Hsieh2015-05-191-17/+24
| | | | | | | | | | | | | | | | | | | with BLAS JIRA: https://issues.apache.org/jira/browse/SPARK-7652 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6189 from viirya/naive_bayes_blas_prediction and squashes the following commits: ab611fd [Liang-Chi Hsieh] Remove unnecessary space. ddc48b9 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into naive_bayes_blas_prediction b5772b4 [Liang-Chi Hsieh] Fix binary compatibility. 2f65186 [Liang-Chi Hsieh] Remove toDense. 1b6cdfe [Liang-Chi Hsieh] Update the implementation of naive Bayes prediction with BLAS. (cherry picked from commit c12dff9b82e4869f866a9b96ce0bf05503dd7dda) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7586] [ML] [DOC] Add docs of Word2Vec in ml packageXusen Yin2015-05-192-0/+165
| | | | | | | | | | | | | | | | | | | | CC jkbradley. JIRA [issue](https://issues.apache.org/jira/browse/SPARK-7586). Author: Xusen Yin <yinxusen@gmail.com> Closes #6181 from yinxusen/SPARK-7586 and squashes the following commits: 77014c5 [Xusen Yin] comment fix 57a4c07 [Xusen Yin] small fix for docs 1178c8f [Xusen Yin] remove the correctness check in java suite 1c3f389 [Xusen Yin] delete sbt commit 1af152b [Xusen Yin] check python example code 1b5369e [Xusen Yin] add docs of word2vec (cherry picked from commit 68fb2a46edc95f867d4b28597d20da2597f008c1) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-7726] Fix Scaladoc false errorsIulian Dragos2015-05-196-3/+15
| | | | | | | | | | | | | | | | Visibility rules for static members are different in Scala and Java, and this case requires an explicit static import. Even though these are Java files, they are run through scaladoc, which enforces Scala rules. Also reverted the commit that reverts the upgrade to 2.11.6 Author: Iulian Dragos <jaguarul@gmail.com> Closes #6260 from dragos/issue/scaladoc-false-error and squashes the following commits: f2e998e [Iulian Dragos] Revert "[HOTFIX] Revert "[SPARK-7092] Update spark scala version to 2.11.6"" 0bad052 [Iulian Dragos] Fix scaladoc faux-error. (cherry picked from commit 3c4c1f96474b3e66fa1d44ac0177f548cf5a3a10) Signed-off-by: Patrick Wendell <patrick@databricks.com>
* [SPARK-7678] [ML] Fix default random seed in HasSeedJoseph K. Bradley2015-05-196-12/+14
| | | | | | | | | | | | | | | | | Changed shared param HasSeed to have default based on hashCode of class name, instead of random number. Also, removed fixed random seeds from Word2Vec and ALS. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #6251 from jkbradley/scala-fixed-seed and squashes the following commits: 0e37184 [Joseph K. Bradley] Fixed Word2VecSuite, ALSSuite in spark.ml to use original fixed random seeds 678ec3a [Joseph K. Bradley] Removed fixed random seeds from Word2Vec and ALS. Changed shared param HasSeed to have default based on hashCode of class name, instead of random number. (cherry picked from commit 7b16e9f2118fbfbb1c0ba957161fe500c9aff82a) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7047] [ML] ml.Model optional parent supportJoseph K. Bradley2015-05-193-1/+7
| | | | | | | | | | | | | | | Made Model.parent transient. Added Model.hasParent to test for null parent CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #5914 from jkbradley/parent-optional and squashes the following commits: d501774 [Joseph K. Bradley] Made Model.parent transient. Added Model.hasParent to test for null parent (cherry picked from commit fb90273212dc7241c9a0c3446e25e0e0b9377750) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7704] Updating Programming Guides per SPARK-4397Dice2015-05-191-6/+5
| | | | | | | | | | | | | | | | The change per SPARK-4397 makes implicit objects in SparkContext to be found by the compiler automatically. So that we don't need to import the o.a.s.SparkContext._ explicitly any more and can remove some statements around the "implicit conversions" from the latest Programming Guides (1.3.0 and higher) Author: Dice <poleon.kd@gmail.com> Closes #6234 from daisukebe/patch-1 and squashes the following commits: b77ecd9 [Dice] fix a typo 45dfcd3 [Dice] rewording per Sean's advice a094bcf [Dice] Adding a note for users on any previous releases a29be5f [Dice] Updating Programming Guides per SPARK-4397 (cherry picked from commit 32fa611b19c6b95d4563be631c5a8ff0cdf3438f) Signed-off-by: Sean Owen <sowen@cloudera.com>
* [SPARK-7681] [MLLIB] remove mima excludes for 1.3Xiangrui Meng2015-05-191-8/+1
| | | | | | | | | | | | | There excludes are unnecessary for 1.3 because the changes were made in 1.4.x. Author: Xiangrui Meng <meng@databricks.com> Closes #6254 from mengxr/SPARK-7681-mima and squashes the following commits: 7f0cea0 [Xiangrui Meng] remove mima excludes for 1.3 (cherry picked from commit 6845cb2ff475fd794b30b01af5ebc80714b880f0) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* Preparing development version 1.4.1-SNAPSHOTPatrick Wendell2015-05-1930-30/+30
|
* Preparing Spark release v1.4.0-rc1Patrick Wendell2015-05-1930-30/+30
|
* CHANGES.txt updatesPatrick Wendell2015-05-191-0/+35
|
* [SPARK-7723] Fix string interpolation in pipeline examplesSaleem Ansari2015-05-191-2/+2
| | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-7723 Author: Saleem Ansari <tuxdna@gmail.com> Closes #6258 from tuxdna/master and squashes the following commits: 2bb5a42 [Saleem Ansari] Merge branch 'master' into mllib-pipeline e39db9c [Saleem Ansari] Fix string interpolation in pipeline examples (cherry picked from commit df34793ad4e76214fc4c0a22af1eb89b171a32e4) Signed-off-by: Sean Owen <sowen@cloudera.com>
* [HOTFIX] Revert "[SPARK-7092] Update spark scala version to 2.11.6"Patrick Wendell2015-05-192-3/+3
| | | | | | | This reverts commit a11c8683c76c67f45749a1b50a0912a731fd2487. For more information see: https://issues.apache.org/jira/browse/SPARK-7726
* Revert "Preparing Spark release v1.4.0-rc1"Patrick Wendell2015-05-1930-30/+30
| | | | This reverts commit 79fb01a3be07b5086134a6fe103248e9a33a9500.
* Revert "Preparing development version 1.4.1-SNAPSHOT"Patrick Wendell2015-05-1930-30/+30
| | | | This reverts commit a1d896b85bd3fb88284f8b6758d7e5f0a1bb9eb3.
* Fixing a few basic typos in the Programming Guide.Mike Dusenberry2015-05-191-3/+3
| | | | | | | | | | | | | Just a few minor fixes in the guide, so a new JIRA issue was not created per the guidelines. Author: Mike Dusenberry <dusenberrymw@gmail.com> Closes #6240 from dusenberrymw/Fix_Programming_Guide_Typos and squashes the following commits: ffa76eb [Mike Dusenberry] Fixing a few basic typos in the Programming Guide. (cherry picked from commit 61f164d3fdd1c8dcdba8c9d66df05ff4069aa6e6) Signed-off-by: Sean Owen <sowen@cloudera.com>
* Preparing development version 1.4.1-SNAPSHOTPatrick Wendell2015-05-1930-30/+30
|
* Preparing Spark release v1.4.0-rc1Patrick Wendell2015-05-1930-30/+30
|
* Updating CHANGES.txt for Spark 1.4Patrick Wendell2015-05-191-0/+70
|
* Revert "Preparing Spark release v1.4.0-rc1"Patrick Wendell2015-05-1930-30/+30
| | | | This reverts commit 38ccef36c1551dc36d9444f47df11ae34c1e139e.
* Revert "Preparing development version 1.4.1-SNAPSHOT"Patrick Wendell2015-05-1930-30/+30
| | | | This reverts commit 40190ce22622cadd41f740a763fba061281c2966.
* [SPARK-7581] [ML] [DOC] User guide for spark.ml PolynomialExpansionXusen Yin2015-05-192-0/+174
| | | | | | | | | | | | | | | | | | | | | JIRA [here](https://issues.apache.org/jira/browse/SPARK-7581). CC jkbradley Author: Xusen Yin <yinxusen@gmail.com> Closes #6113 from yinxusen/SPARK-7581 and squashes the following commits: 1a7d80d [Xusen Yin] merge with master 892a8e9 [Xusen Yin] fix python 3 compatibility ec935bf [Xusen Yin] small fix 3e9fa1d [Xusen Yin] delete note 69fcf85 [Xusen Yin] simplify and add python example 81d21dc [Xusen Yin] add programming guide for Polynomial Expansion 40babfb [Xusen Yin] add java test suite for PolynomialExpansion (cherry picked from commit 6008ec14ed6491d0a854bb50548c46f2f9709269) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [HOTFIX] Fixing style failures in Kinesis sourcePatrick Wendell2015-05-192-4/+6
|
* Preparing development version 1.4.1-SNAPSHOTPatrick Wendell2015-05-1930-30/+30
|
* Preparing Spark release v1.4.0-rc1Patrick Wendell2015-05-1930-30/+30
|
* Revert "Preparing Spark release v1.4.0-rc1"Patrick Wendell2015-05-1830-30/+30
| | | | This reverts commit e8e97e3a630dea3c68702e26bc56f61044b2db71.
* Revert "Preparing development version 1.4.1-SNAPSHOT"Patrick Wendell2015-05-1830-30/+30
| | | | This reverts commit 758ca74bab7c342f94442f69476c6b9543ac1228.
* [HOTFIX]: Java 6 Build BreaksPatrick Wendell2015-05-192-15/+2
| | | | These were blocking RC1 so I fixed them manually.
* Preparing development version 1.4.1-SNAPSHOTPatrick Wendell2015-05-1930-30/+30
|
* Preparing Spark release v1.4.0-rc1Patrick Wendell2015-05-1930-30/+30
|
* [SPARK-7687] [SQL] DataFrame.describe() should cast all aggregates to StringJosh Rosen2015-05-183-14/+19
| | | | | | | | | | | | | | | | In `DataFrame.describe()`, the `count` aggregate produces an integer, the `avg` and `stdev` aggregates produce doubles, and `min` and `max` aggregates can produce varying types depending on what type of column they're applied to. As a result, we should cast all aggregate results to String so that `describe()`'s output types match its declared output schema. Author: Josh Rosen <joshrosen@databricks.com> Closes #6218 from JoshRosen/SPARK-7687 and squashes the following commits: 146b615 [Josh Rosen] Fix R test. 2974bd5 [Josh Rosen] Cast to string type instead f206580 [Josh Rosen] Cast to double to fix SPARK-7687 307ecbf [Josh Rosen] Add failing regression test for SPARK-7687 (cherry picked from commit c9fa870a6de3f7d0903fa7a75ea5ffb6a2fcd174) Signed-off-by: Reynold Xin <rxin@databricks.com>
* CHANGES.txt and changelist updaets for Spark 1.4.Patrick Wendell2015-05-182-2/+14608
|
* [SPARK-7150] SparkContext.range() and SQLContext.range()Daoyuan Wang2015-05-187-0/+189
| | | | | | | | | | | | | | | | | | | | | | | | | This PR is based on #6081, thanks adrian-wang. Closes #6081 Author: Daoyuan Wang <daoyuan.wang@intel.com> Author: Davies Liu <davies@databricks.com> Closes #6230 from davies/range and squashes the following commits: d3ce5fe [Davies Liu] add tests 789eda5 [Davies Liu] add range() in Python 4590208 [Davies Liu] Merge commit 'refs/pull/6081/head' of github.com:apache/spark into range cbf5200 [Daoyuan Wang] let's add python support in a separate PR f45e3b2 [Daoyuan Wang] remove redundant toLong 617da76 [Daoyuan Wang] fix safe marge for corner cases 867c417 [Daoyuan Wang] fix 13dbe84 [Daoyuan Wang] update bd998ba [Daoyuan Wang] update comments d3a0c1b [Daoyuan Wang] add range api() (cherry picked from commit c2437de1899e09894df4ec27adfaa7fac158fd3a) Signed-off-by: Reynold Xin <rxin@databricks.com>
* Version updates for Spark 1.4.0Patrick Wendell2015-05-183-3/+4
|
* [SPARK-7681] [MLLIB] Add SparseVector support for gemvLiang-Chi Hsieh2015-05-184-33/+240
| | | | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-7681 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6209 from viirya/sparsevector_gemv and squashes the following commits: ce0bb8b [Liang-Chi Hsieh] Still need to scal y when beta is 0.0 because it clears out y. b890e63 [Liang-Chi Hsieh] Do not delete multiply for DenseVector. 57a8c1e [Liang-Chi Hsieh] Add MimaExcludes for v1.4. 458d1ae [Liang-Chi Hsieh] List DenseMatrix.multiply and SparseMatrix.multiply to MimaExcludes too. 054f05d [Liang-Chi Hsieh] Fix scala style. 410381a [Liang-Chi Hsieh] Address comments. Make Matrix.multiply more generalized. 4616696 [Liang-Chi Hsieh] Add support for SparseVector with SparseMatrix. 5d6d07a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sparsevector_gemv c069507 [Liang-Chi Hsieh] Add SparseVector support for gemv with DenseMatrix. (cherry picked from commit d03638cc2d414cee9ac7481084672e454495dfc1) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7692] Updated Kinesis examplesTathagata Das2015-05-182-237/+268
| | | | | | | | | | | | | | | | | | | - Updated Kinesis examples to use stable API - Cleaned up comments, etc. - Renamed KinesisWordCountProducerASL to KinesisWordProducerASL Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6249 from tdas/kinesis-examples and squashes the following commits: 7cc307b [Tathagata Das] More tweaks f080872 [Tathagata Das] More cleanup 841987f [Tathagata Das] Small update 011cbe2 [Tathagata Das] More fixes b0d74f9 [Tathagata Das] Updated examples. (cherry picked from commit 3a6003866ade45974b43a9e785ec35fb76a32b99) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-7621] [STREAMING] Report Kafka errors to StreamingListenersjerluc2015-05-182-2/+2
| | | | | | | | | | | | | PR per [SPARK-7621](https://issues.apache.org/jira/browse/SPARK-7621), which makes both `KafkaReceiver` and `ReliableKafkaReceiver` report its errors to the `ReceiverTracker`, which in turn will add the events to the bus to fire off any registered `StreamingListener`s. Author: jerluc <jeremyalucas@gmail.com> Closes #6204 from jerluc/master and squashes the following commits: 82439a5 [jerluc] [SPARK-7621] [STREAMING] Report Kafka errors to StreamingListeners (cherry picked from commit 0a7a94eab5fba3d2f2ef14a70c2c1bf4ee21b626) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-7624] Revert #4147Davies Liu2015-05-181-21/+2
| | | | | | | | | | | Author: Davies Liu <davies@databricks.com> Closes #6172 from davies/revert_4147 and squashes the following commits: 3bfbbde [Davies Liu] Revert #4147 (cherry picked from commit 4fb52f9545ae338fae2d3aeea4bfc35d5df44853) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [SQL] Fix serializability of ORC table scanMichael Armbrust2015-05-181-1/+1
| | | | | | | | | | | | | A follow-up to #6244. Author: Michael Armbrust <michael@databricks.com> Closes #6247 from marmbrus/fixOrcTests and squashes the following commits: e39ee1b [Michael Armbrust] [SQL] Fix serializability of ORC table scan (cherry picked from commit eb4632f282d070e1dfd5ffed968fa212896137da) Signed-off-by: Yin Huai <yhuai@databricks.com>
* [SPARK-7501] [STREAMING] DAG visualization: show DStream operationsAndrew Or2015-05-1814-145/+484
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is similar to #5999, but for streaming. Roughly 200 lines are tests. One thing to note here is that we already do some kind of scoping thing for call sites, so this patch adds the new RDD operation scoping logic in the same place. Also, this patch adds a `try finally` block to set the relevant variables in a safer way. tdas zsxwing ------------------------ **Before** <img src="https://cloud.githubusercontent.com/assets/2133137/7625996/d88211b8-f9b4-11e4-90b9-e11baa52d6d7.png" width="450px"/> -------------------------- **After** <img src="https://cloud.githubusercontent.com/assets/2133137/7625997/e0878f8c-f9b4-11e4-8df3-7dd611b13c87.png" width="650px"/> Author: Andrew Or <andrew@databricks.com> Closes #6034 from andrewor14/dag-viz-streaming and squashes the following commits: 932a64a [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming e685df9 [Andrew Or] Rename createRDDWith 84d0656 [Andrew Or] Review feedback 697c086 [Andrew Or] Fix tests 53b9936 [Andrew Or] Set scopes for foreachRDD properly 1881802 [Andrew Or] Refactor DStream scope names again af4ba8d [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming fd07d22 [Andrew Or] Make MQTT lower case f6de871 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming 0ca1801 [Andrew Or] Remove a few unnecessary withScopes on aliases fa4e5fb [Andrew Or] Pass in input stream name rather than defining it from within 1af0b0e [Andrew Or] Fix style 074c00b [Andrew Or] Review comments d25a324 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming e4a93ac [Andrew Or] Fix tests? 25416dc [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming 9113183 [Andrew Or] Add tests for DStream scopes b3806ab [Andrew Or] Fix test bb80bbb [Andrew Or] Fix MIMA? 5c30360 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming 5703939 [Andrew Or] Rename operations that create InputDStreams 7c4513d [Andrew Or] Group RDDs by DStream operations and batches bf0ab6e [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming 05c2676 [Andrew Or] Wrap many more methods in withScope c121047 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming 65ef3e9 [Andrew Or] Fix NPE a0d3263 [Andrew Or] Scope streaming operations instead of RDD operations (cherry picked from commit b93c97d79b42a06b48d2a8d98beccc636442541e) Signed-off-by: Andrew Or <andrew@databricks.com>
* [HOTFIX] Fix ORC build breakMichael Armbrust2015-05-181-5/+6
| | | | | | | | | | | | | Fix break caused by merging #6225 and #6194. Author: Michael Armbrust <michael@databricks.com> Closes #6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break (cherry picked from commit fcf90b75ccf222bd2f1939addc3f8f052d2bd3ff) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-7658] [STREAMING] [WEBUI] Update the mouse behaviors for the timeline ↵zsxwing2015-05-183-2/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | graphs 1. If the user click one point of a batch, scroll down to the corresponding batch row and highlight it. And recovery the batch row after 3 seconds if necessary. 2. Add "#batches" in the histogram graphs. ![screen shot 2015-05-14 at 7 36 19 pm](https://cloud.githubusercontent.com/assets/1000778/7646108/84f4a014-fa73-11e4-8c13-1903d267e60f.png) ![screen shot 2015-05-14 at 7 36 53 pm](https://cloud.githubusercontent.com/assets/1000778/7646109/8b11154a-fa73-11e4-820b-8ece9fa6ee3e.png) ![screen shot 2015-05-14 at 7 36 34 pm](https://cloud.githubusercontent.com/assets/1000778/7646111/93828272-fa73-11e4-89f8-580670144d3c.png) Author: zsxwing <zsxwing@gmail.com> Closes #6168 from zsxwing/SPARK-7658 and squashes the following commits: c242b00 [zsxwing] Change 5 seconds to 3 seconds 31fd0aa [zsxwing] Remove the mouseover highlight feature 06c6f6f [zsxwing] Merge branch 'master' into SPARK-7658 2eaff06 [zsxwing] Merge branch 'master' into SPARK-7658 108d56c [zsxwing] Update the mouse behaviors for the timeline graphs (cherry picked from commit 0b6f503d5337a8387c37cc2c8e544f67c68f7dad) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-6216] [PYSPARK] check python version of worker with driverDavies Liu2015-05-1810-14/+26
| | | | | | | | | | | | | | | This PR revert #5404, change to pass the version of python in driver into JVM, check it in worker before deserializing closure, then it can works with different major version of Python. Author: Davies Liu <davies@databricks.com> Closes #6203 from davies/py_version and squashes the following commits: b8fb76e [Davies Liu] fix test 6ce5096 [Davies Liu] use string for version 47c6278 [Davies Liu] check python version of worker with driver (cherry picked from commit 32fbd297dd651ba3ce4ce52aeb0488233149cdf9) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [SPARK-7673] [SQL] WIP: HadoopFsRelation and ParquetRelation2 performance ↵Cheng Lian2015-05-184-91/+117
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | optimizations This PR introduces several performance optimizations to `HadoopFsRelation` and `ParquetRelation2`: 1. Moving `FileStatus` listing from `DataSourceStrategy` into a cache within `HadoopFsRelation`. This new cache generalizes and replaces the one used in `ParquetRelation2`. This also introduces an interface change: to reuse cached `FileStatus` objects, `HadoopFsRelation.buildScan` methods now receive `Array[FileStatus]` instead of `Array[String]`. 1. When Parquet task side metadata reading is enabled, skip reading row group information when reading Parquet footers. This is basically what PR #5334 does. Also, now we uses `ParquetFileReader.readAllFootersInParallel` to read footers in parallel. Another optimization in question is, instead of asking `HadoopFsRelation.buildScan` to return an `RDD[Row]` for a single selected partition and then union them all, we ask it to return an `RDD[Row]` for all selected partitions. This optimization is based on the fact that Hadoop configuration broadcasting used in `NewHadoopRDD` takes 34% time in the following microbenchmark. However, this complicates data source user code because user code must merge partition values manually. To check the cost of broadcasting in `NewHadoopRDD`, I also did microbenchmark after removing the `broadcast` call in `NewHadoopRDD`. All results are shown below. ### Microbenchmark #### Preparation code Generating a partitioned table with 50k partitions, 1k rows per partition: ```scala import sqlContext._ import sqlContext.implicits._ for (n <- 0 until 500) { val data = for { p <- (n * 10) until ((n + 1) * 10) i <- 0 until 1000 } yield (i, f"val_$i%04d", f"$p%04d") data. toDF("a", "b", "p"). write. partitionBy("p"). mode("append"). parquet(path) } ``` #### Benchmarking code ```scala import sqlContext._ import sqlContext.implicits._ import org.apache.spark.sql.types._ import com.google.common.base.Stopwatch val path = "hdfs://localhost:9000/user/lian/5k" def benchmark(n: Int)(f: => Unit) { val stopwatch = new Stopwatch() def run() = { stopwatch.reset() stopwatch.start() f stopwatch.stop() stopwatch.elapsedMillis() } val records = (0 until n).map(_ => run()) (0 until n).foreach(i => println(s"Round $i: ${records(i)} ms")) println(s"Average: ${records.sum / n.toDouble} ms") } benchmark(3) { read.parquet(path).explain(extended = true) } ``` #### Results Before: ``` Round 0: 72528 ms Round 1: 68938 ms Round 2: 65372 ms Average: 68946.0 ms ``` After: ``` Round 0: 59499 ms Round 1: 53645 ms Round 2: 53844 ms Round 3: 49093 ms Round 4: 50555 ms Average: 53327.2 ms ``` Also removing Hadoop configuration broadcasting: (Note that I was testing on a local laptop, thus network cost is pretty low.) ``` Round 0: 15806 ms Round 1: 14394 ms Round 2: 14699 ms Round 3: 15334 ms Round 4: 14123 ms Average: 14871.2 ms ``` Author: Cheng Lian <lian@databricks.com> Closes #6225 from liancheng/spark-7673 and squashes the following commits: 2d58a2b [Cheng Lian] Skips reading row group information when using task side metadata reading 7aa3748 [Cheng Lian] Optimizes FileStatusCache by introducing a map from parent directories to child files ba41250 [Cheng Lian] Reuses HadoopFsRelation FileStatusCache in ParquetRelation2 3d278f7 [Cheng Lian] Fixes a bug when reading a single Parquet data file b84612a [Cheng Lian] Fixes Scala style issue 6a08b02 [Cheng Lian] WIP: Moves file status cache into HadoopFSRelation (cherry picked from commit 9dadf019b93038e1e18336ccd06c5eecb4bae32f) Signed-off-by: Yin Huai <yhuai@databricks.com>
* [SPARK-7567] [SQL] [follow-up] Use a new flag to set output committer based ↵Yin Huai2015-05-184-9/+29
| | | | | | | | | | | | | | | on mapreduce apis cc liancheng marmbrus Author: Yin Huai <yhuai@databricks.com> Closes #6130 from yhuai/directOutput and squashes the following commits: 312b07d [Yin Huai] A data source can use spark.sql.sources.outputCommitterClass to override the output committer. (cherry picked from commit 530397ba2f5c0fcabb86ba73048c95177ed0b9fc) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-7269] [SQL] Incorrect analysis for aggregation(use semanticEquals)Wenchen Fan2015-05-186-26/+48
| | | | | | | | | | | | | | | A modified version of https://github.com/apache/spark/pull/6110, use `semanticEquals` to make it more efficient. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6173 from cloud-fan/7269 and squashes the following commits: e4a3cc7 [Wenchen Fan] address comments cc02045 [Wenchen Fan] consider elements length equal d7ff8f4 [Wenchen Fan] fix 7269 (cherry picked from commit 103c863c2ef3d9e6186cfc7d95251a9515e9f180) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-7631] [SQL] treenode argString should not print childrenscwf2015-05-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | spark-sql> > explain extended > select * from ( > select key from src union all > select key from src) t; now the spark plan will print children in argString ``` == Physical Plan == Union[ HiveTableScan key#1, (MetastoreRelation default, src, None), None, HiveTableScan key#3, (MetastoreRelation default, src, None), None] HiveTableScan key#1, (MetastoreRelation default, src, None), None HiveTableScan key#3, (MetastoreRelation default, src, None), None ``` after this patch: ``` == Physical Plan == Union HiveTableScan [key#1], (MetastoreRelation default, src, None), None HiveTableScan [key#3], (MetastoreRelation default, src, None), None ``` I have tested this locally Author: scwf <wangfei1@huawei.com> Closes #6144 from scwf/fix-argString and squashes the following commits: 1a642e0 [scwf] fix treenode argString (cherry picked from commit fc2480ed13742a99470b5012ca3a75ab91e5a5e5) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-2883] [SQL] ORC data source for Spark SQLZhan Zhang2015-05-1814-76/+1477
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR updates PR #6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR #3753). Author: Zhan Zhang <zhazhan@gmail.com> Author: Cheng Lian <lian@databricks.com> Closes #6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @since and @Experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support (cherry picked from commit aa31e431fc09f0477f1c2351c6275769a31aca90) Signed-off-by: Michael Armbrust <michael@databricks.com>