aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [MINOR] Add 1.3, 1.3.1 to master branch EC2 scriptsShivaram Venkataraman2015-05-171-1/+5
| | | | | | | | | | | | cc pwendell P.S: I can't believe this was outdated all along ? Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6215 from shivaram/update-ec2-map and squashes the following commits: ae3937a [Shivaram Venkataraman] Add 1.3, 1.3.1 to master branch EC2 scripts
* [MINOR] [SQL] Removes an unreachable case clauseCheng Lian2015-05-161-1/+0
| | | | | | | | | | This case clause is already covered by the one above, and generates a compilation warning. Author: Cheng Lian <lian@databricks.com> Closes #6214 from liancheng/remove-unreachable-code and squashes the following commits: c38ca7c [Cheng Lian] Removes an unreachable case clause
* [SPARK-7654][SQL] Move JDBC into DataFrame's reader/writer interface.Reynold Xin2015-05-1620-743/+729
| | | | | | | | | | | | | Also moved all the deprecated functions into one place for SQLContext and DataFrame, and updated tests to use the new API. Author: Reynold Xin <rxin@databricks.com> Closes #6210 from rxin/df-writer-reader-jdbc and squashes the following commits: 7465c2c [Reynold Xin] Fixed unit test. 118e609 [Reynold Xin] Updated tests. 3441b57 [Reynold Xin] Updated javadoc. 13cdd1c [Reynold Xin] [SPARK-7654][SQL] Move JDBC into DataFrame's reader/writer interface.
* [SPARK-7655][Core] Deserializing value should not hold the TaskSchedulerImpl ↵zsxwing2015-05-163-2/+31
| | | | | | | | | | | | | | lock We should not call `DirectTaskResult.value` when holding the `TaskSchedulerImpl` lock. It may cost dozens of seconds to deserialize a large object. Author: zsxwing <zsxwing@gmail.com> Closes #6195 from zsxwing/SPARK-7655 and squashes the following commits: 21f502e [zsxwing] Add more comments e25fa88 [zsxwing] Add comments 15010b5 [zsxwing] Deserialize value should not hold the TaskSchedulerImpl lock
* [SPARK-7654][MLlib] Migrate MLlib to the DataFrame reader/writer API.Reynold Xin2015-05-1613-16/+16
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #6211 from rxin/mllib-reader and squashes the following commits: 79a2cb9 [Reynold Xin] [SPARK-7654][MLlib] Migrate MLlib to the DataFrame reader/writer API.
* [BUILD] update jblas dependency version to 1.2.4Matthew Brandyberry2015-05-162-2/+2
| | | | | | | | | | jblas 1.2.4 includes native library support for PPC64LE. Author: Matthew Brandyberry <mbrandy@us.ibm.com> Closes #6199 from mtbrandy/jblas-1.2.4 and squashes the following commits: 9df9301 [Matthew Brandyberry] [BUILD] update jblas dependency version to 1.2.4
* [HOTFIX] [SQL] Fixes DataFrameWriter.mode(String)Cheng Lian2015-05-162-1/+8
| | | | | | | | | | | | | We forgot an assignment there. /cc rxin Author: Cheng Lian <lian@databricks.com> Closes #6212 from liancheng/fix-df-writer and squashes the following commits: 711fbb0 [Cheng Lian] Adds a test case 3b72d78 [Cheng Lian] Fixes DataFrameWriter.mode(String)
* [SPARK-7655][Core][SQL] Remove ↵zsxwing2015-05-164-6/+48
| | | | | | | | | | | | | | | | 'scala.concurrent.ExecutionContext.Implicits.global' in 'ask' and 'BroadcastHashJoin' Because both `AkkaRpcEndpointRef.ask` and `BroadcastHashJoin` uses `scala.concurrent.ExecutionContext.Implicits.global`. However, because the tasks in `BroadcastHashJoin` are usually long-running tasks, which will occupy all threads in `global`. Then `ask` cannot get a chance to process the replies. For `ask`, actually the tasks are very simple, so we can use `MoreExecutors.sameThreadExecutor()`. For `BroadcastHashJoin`, it's better to use `ThreadUtils.newDaemonCachedThreadPool`. Author: zsxwing <zsxwing@gmail.com> Closes #6200 from zsxwing/SPARK-7655-2 and squashes the following commits: cfdc605 [zsxwing] Remove redundant imort and minor doc fix cf83153 [zsxwing] Add "sameThread" and "newDaemonCachedThreadPool with maxThreadNumber" to ThreadUtils 08ad0ee [zsxwing] Remove 'scala.concurrent.ExecutionContext.Implicits.global' in 'ask' and 'BroadcastHashJoin'
* [SPARK-7672] [CORE] Use int conversion in translating ↵Nishkam Ravi2015-05-162-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kryoserializer.buffer.mb to kryoserializer.buffer In translating spark.kryoserializer.buffer.mb to spark.kryoserializer.buffer, use of toDouble will lead to "Fractional values not supported" error even when spark.kryoserializer.buffer.mb is an integer. ilganeli, andrewor14 Author: Nishkam Ravi <nravi@cloudera.com> Author: nishkamravi2 <nishkamravi@gmail.com> Author: nravi <nravi@c1704.halxg.cloudera.com> Closes #6198 from nishkamravi2/master_nravi and squashes the following commits: 171a53c [nishkamravi2] Update SparkConfSuite.scala 5261bf6 [Nishkam Ravi] Add a test for deprecated config spark.kryoserializer.buffer.mb 5190f79 [Nishkam Ravi] In translating from deprecated spark.kryoserializer.buffer.mb to spark.kryoserializer.buffer use int conversion since fractions are not permissible 059ce82 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi eaa13b5 [nishkamravi2] Update Client.scala 981afd2 [Nishkam Ravi] Check for read permission before initiating copy 1b81383 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 0f1abd0 [nishkamravi2] Update Utils.scala 474e3bf [nishkamravi2] Update DiskBlockManager.scala 97c383e [nishkamravi2] Update Utils.scala 8691e0c [Nishkam Ravi] Add a try/catch block around Utils.removeShutdownHook 2be1e76 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 1c13b79 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi bad4349 [nishkamravi2] Update Main.java 36a6f87 [Nishkam Ravi] Minor changes and bug fixes b7f4ae7 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 4a45d6a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 458af39 [Nishkam Ravi] Locate the jar using getLocation, obviates the need to pass assembly path as an argument d9658d6 [Nishkam Ravi] Changes for SPARK-6406 ccdc334 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 3faa7a4 [Nishkam Ravi] Launcher library changes (SPARK-6406) 345206a [Nishkam Ravi] spark-class merge Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi ac58975 [Nishkam Ravi] spark-class changes 06bfeb0 [nishkamravi2] Update spark-class 35af990 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 32c3ab3 [nishkamravi2] Update AbstractCommandBuilder.java 4bd4489 [nishkamravi2] Update AbstractCommandBuilder.java 746f35b [Nishkam Ravi] "hadoop" string in the assembly name should not be mandatory (everywhere else in spark we mandate spark-assembly*hadoop*.jar) bfe96e0 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi ee902fa [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi d453197 [nishkamravi2] Update NewHadoopRDD.scala 6f41a1d [nishkamravi2] Update NewHadoopRDD.scala 0ce2c32 [nishkamravi2] Update HadoopRDD.scala f7e33c2 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi ba1eb8b [Nishkam Ravi] Try-catch block around the two occurrences of removeShutDownHook. Deletion of semi-redundant occurrences of expensive operation inShutDown. 71d0e17 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 494d8c0 [nishkamravi2] Update DiskBlockManager.scala 3c5ddba [nishkamravi2] Update DiskBlockManager.scala f0d12de [Nishkam Ravi] Workaround for IllegalStateException caused by recent changes to BlockManager.stop 79ea8b4 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi b446edc [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala 535295a [nishkamravi2] Update TaskSetManager.scala 3e1b616 [Nishkam Ravi] Modify test for maxResultSize 9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0) 5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 636a9ff [nishkamravi2] Update YarnAllocator.scala 8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead 35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead 5ac2ec1 [Nishkam Ravi] Remove out dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue 42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue 362da5e [Nishkam Ravi] Additional changes for yarn memory overhead c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead 1cf2d1e [nishkamravi2] Update YarnAllocator.scala ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts) 2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark 2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark 3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark 5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456) 6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed) 5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456) 681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
* [SPARK-4556] [BUILD] binary distribution assembly can't run in local modeSean Owen2015-05-161-0/+10
| | | | | | | | | | | Add note on building a runnable distribution with make-distribution.sh Author: Sean Owen <sowen@cloudera.com> Closes #6186 from srowen/SPARK-4556 and squashes the following commits: 4002966 [Sean Owen] Add pointer to --help flag 9fa7883 [Sean Owen] Add note on building a runnable distribution with make-distribution.sh
* [SPARK-7671] Fix wrong URLs in MLlib Data Types DocumentationFavioVazquez2015-05-161-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a mistake in the URL of Matrices in the MLlib Data Types documentation (Local matrix scala section), the URL points to https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices which is a mistake, since Matrices is an object that implements factory methods for Matrix that does not have a companion class. The correct link should point to https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$ There is another mistake, in the Local Vector section in Scala, Java and Python In the Scala section the URL of Vectors points to the trait Vector (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and not to the factory methods implemented in Vectors. The correct link should be: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$ In the Java section the URL of Vectors points to the Interface Vector (https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/Vector.html) and not to the Class Vectors The correct link should be: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/Vectors.html In the Python section the URL of Vectors points to the class Vector (https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vector) and not the Class Vectors The correct link should be: https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors Author: FavioVazquez <favio.vazquezp@gmail.com> Closes #6196 from FavioVazquez/fix-typo-matrices-mllib-datatypes and squashes the following commits: 3e9efd5 [FavioVazquez] - Fixed wrong URLs in the MLlib Data Types Documentation 9af7074 [FavioVazquez] Merge remote-tracking branch 'upstream/master' edab1ef [FavioVazquez] Merge remote-tracking branch 'upstream/master' b2e2f8c [FavioVazquez] Merge remote-tracking branch 'upstream/master'
* [SPARK-7654][SQL] DataFrameReader and DataFrameWriter for input/output APIReynold Xin2015-05-1524-541/+772
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces DataFrameWriter and DataFrameReader. DataFrameReader interface, accessible through SQLContext.read, contains methods that create DataFrames. These methods used to reside in SQLContext. Example usage: ```scala sqlContext.read.json("...") sqlContext.read.parquet("...") ``` DataFrameWriter interface, accessible through DataFrame.write, implements a builder pattern to avoid the proliferation of options in writing DataFrame out. It currently implements: - mode - format (e.g. "parquet", "json") - options (generic options passed down into data sources) - partitionBy (partitioning columns) Example usage: ```scala df.write.mode("append").format("json").partitionBy("date").saveAsTable("myJsonTable") ``` TODO: - [ ] Documentation update - [ ] Move JDBC into reader / writer? - [ ] Deprecate the old interfaces - [ ] Move the generic load interface into reader. - [ ] Update example code and documentation Author: Reynold Xin <rxin@databricks.com> Closes #6175 from rxin/reader-writer and squashes the following commits: b146c95 [Reynold Xin] Deprecation of old APIs. bd8abdf [Reynold Xin] Fixed merge conflict. 26abea2 [Reynold Xin] Added general load methods. 244fbec [Reynold Xin] Added equivalent to example. 4f15d92 [Reynold Xin] Added documentation for partitionBy. 7e91611 [Reynold Xin] [SPARK-7654][SQL] DataFrameReader and DataFrameWriter for input/output API.
* [SPARK-7473] [MLLIB] Add reservoir sample in RandomForestAiHe2015-05-152-4/+3
| | | | | | | | | | | | | reservoir feature sample by using existing api Author: AiHe <ai.he@ussuning.com> Closes #5988 from AiHe/reservoir and squashes the following commits: e7a41ac [AiHe] remove non-robust testing case 28ffb9a [AiHe] set seed as rng.nextLong 37459e1 [AiHe] set fixed seed 1e98a4c [AiHe] [MLLIB][tree] Add reservoir sample in RandomForest
* [SPARK-7543] [SQL] [PySpark] split dataframe.py into multiple filesDavies Liu2015-05-156-449/+552
| | | | | | | | | | | | | | | dataframe.py is splited into column.py, group.py and dataframe.py: ``` 360 column.py 1223 dataframe.py 183 group.py ``` Author: Davies Liu <davies@databricks.com> Closes #6201 from davies/split_df and squashes the following commits: fc8f5ab [Davies Liu] split dataframe.py into multiple files
* [SPARK-7073] [SQL] [PySpark] Clean up SQL data type hierarchy in PythonDavies Liu2015-05-151-30/+46
| | | | | | | | Author: Davies Liu <davies@databricks.com> Closes #6206 from davies/sql_type and squashes the following commits: 33d6860 [Davies Liu] [SPARK-7073] [SQL] [PySpark] Clean up SQL data type hierarchy in Python
* [SPARK-7575] [ML] [DOC] Example code for OneVsRestRam Sriharsha2015-05-152-0/+421
| | | | | | | | | | | | | | | | | | | | | | | | | Java and Scala examples for OneVsRest. Fixes the base classifier to be Logistic Regression and accepts the configuration parameters of the base classifier. Author: Ram Sriharsha <rsriharsha@hw11853.local> Closes #6115 from harsha2010/SPARK-7575 and squashes the following commits: 87ad3c7 [Ram Sriharsha] extra line f5d9891 [Ram Sriharsha] Merge branch 'master' into SPARK-7575 7076084 [Ram Sriharsha] cleanup dfd660c [Ram Sriharsha] cleanup 8703e4f [Ram Sriharsha] update doc cb23995 [Ram Sriharsha] fix commandline options for JavaOneVsRestExample 69e91f8 [Ram Sriharsha] cleanup 7f4e127 [Ram Sriharsha] cleanup d4c40d0 [Ram Sriharsha] Code Review fixes 461eb38 [Ram Sriharsha] cleanup e0106d9 [Ram Sriharsha] Fix typo 935cf56 [Ram Sriharsha] Try to match Java and Scala Example Commandline options 5323ff9 [Ram Sriharsha] cleanup 196a59a [Ram Sriharsha] cleanup 6adfa0c [Ram Sriharsha] Style Fix 8cfc5d5 [Ram Sriharsha] [SPARK-7575] Example code for OneVsRest
* [SPARK-7563] OutputCommitCoordinator.stop() should only run on the driverJosh Rosen2015-05-153-6/+8
| | | | | | | | | | | | This fixes a bug where an executor that exits can cause the driver's OutputCommitCoordinator to stop. To fix this, we use an `isDriver` flag and check it in `stop()`. See https://issues.apache.org/jira/browse/SPARK-7563 for more details. Author: Josh Rosen <joshrosen@databricks.com> Closes #6197 from JoshRosen/SPARK-7563 and squashes the following commits: 04b2cc5 [Josh Rosen] [SPARK-7563] OutputCommitCoordinator.stop() should only be executed on the driver
* [SPARK-7676] Bug fix and cleanup of stage timeline viewKay Ousterhout2015-05-152-47/+20
| | | | | | | | | | | | cc pwendell sarutak This commit cleans up some unnecessary code, eliminates the feature where when you mouse-over a box in the timeline, the corresponding task is highlighted in the table (because that feature is only useful in the rare case when you have a very small number of tasks, in which case it's easy to figure out the mapping anyway), and fixes a bug where nothing shows up if you try to visualize a stage with only 1 task. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #6202 from kayousterhout/SPARK-7676 and squashes the following commits: dfd29d4 [Kay Ousterhout] [SPARK-7676] Bug fix and cleanup of stage timeline view
* [SPARK-7556] [ML] [DOC] Add user guide for spark.ml Binarizer, including ↵Liang-Chi Hsieh2015-05-151-0/+84
| | | | | | | | | | | | | | | Scala, Java and Python examples JIRA: https://issues.apache.org/jira/browse/SPARK-7556 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6116 from viirya/binarizer_doc and squashes the following commits: 40cb677 [Liang-Chi Hsieh] Better print out. 5b7ef1d [Liang-Chi Hsieh] Make examples more clear. 1bf9c09 [Liang-Chi Hsieh] For comments. 6cf8cba [Liang-Chi Hsieh] Add user guide for Binarizer.
* [SPARK-7677] [STREAMING] Add Kafka modules to the 2.11 build.Iulian Dragos2015-05-151-4/+2
| | | | | | | | | | | | This is somewhat related to [SPARK-6154](https://issues.apache.org/jira/browse/SPARK-6154), though it only touches Kafka, not the jline dependency for thriftserver. I tested this locally on 2.11 (./run-tests) and everything looked good (I had to disable mima, because `MimaBuild` harcodes 2.10 for the previous version -- that's another PR). Author: Iulian Dragos <jaguarul@gmail.com> Closes #6149 from dragos/issue/spark-2.11-kafka and squashes the following commits: aa15d99 [Iulian Dragos] Add Kafka modules to the 2.11 build.
* [SPARK-7226] [SPARKR] Support math functions in R DataFrameqhuang2015-05-154-3/+100
| | | | | | | | | | Author: qhuang <qian.huang@intel.com> Closes #6170 from hqzizania/master and squashes the following commits: f20c39f [qhuang] add tests units and fixes 2a7d121 [qhuang] use a function name more familiar to R users 07aa72e [qhuang] Support math functions in R DataFrame
* [SPARK-7296] Add timeline visualization for stages in the UI.Kousuke Saruta2015-05-154-10/+348
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR builds on #2342 by adding a timeline view for the Stage page, showing how tasks spend their time. With this timeline, we can understand following things of a Stage. * When/where each task ran * Total duration of each task * Proportion of the time each task spends Also, this timeline view can scrollable and zoomable. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #5843 from sarutak/stage-page-timeline and squashes the following commits: 4ba9604 [Kousuke Saruta] Fixed the order of legends 16bb552 [Kousuke Saruta] Removed border of legend area 2e5d605 [Kousuke Saruta] Modified warning message 16cb2e6 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into stage-page-timeline 7ae328f [Kousuke Saruta] Modified code style d5f794a [Kousuke Saruta] Fixed performance issues more 64e6642 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into stage-page-timeline e4a3354 [Kousuke Saruta] minor code style change 878e3b8 [Kousuke Saruta] Fixed a bug that tooltip remains b9d8f1b [Kousuke Saruta] Fixed performance issue ac8842b [Kousuke Saruta] Fixed layout 2319739 [Kousuke Saruta] Modified appearances more 81903ab [Kousuke Saruta] Modified appearances a79dcc3 [Kousuke Saruta] Modified appearance 55a390c [Kousuke Saruta] Ignored scalastyle for a line-comment 29eae3e [Kousuke Saruta] limited to longest 1000 tasks 2a9e376 [Kousuke Saruta] Minor cleanup 385b6d2 [Kousuke Saruta] Added link feature ba1ac3e [Kousuke Saruta] Fixed style 2ae8520 [Kousuke Saruta] Updated bootstrap-tooltip.js from 2.2.2 to 2.3.2 af430f1 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into stage-page-timeline e694b8e [Kousuke Saruta] Added timeline view to StagePage 8f6610c [Kousuke Saruta] Fixed conflict b587cf2 [Kousuke Saruta] initial commit 11fe67d [Kousuke Saruta] Fixed conflict 79ac03d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature a91abd3 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into timeline-viewer-feature ef34a5b [Kousuke Saruta] Implement tooltip using bootstrap b09d0c5 [Kousuke Saruta] Move `stroke` and `fill` attribute of rect elements to css d3c63c8 [Kousuke Saruta] Fixed a little bit bugs a36291b [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into timeline-viewer-feature 28714b6 [Kousuke Saruta] Fixed highlight issue 0dc4278 [Kousuke Saruta] Addressed most of Patrics's feedbacks 8110acf [Kousuke Saruta] Added scroll limit to Job timeline 974a64a [Kousuke Saruta] Removed unused function ee7a7f0 [Kousuke Saruta] Refactored 6a91872 [Kousuke Saruta] Temporary commit 6693f34 [Kousuke Saruta] Added link to job/stage box in the timeline in order to move to corresponding row when we click 8f88222 [Kousuke Saruta] Added job/stage description aeed4b1 [Kousuke Saruta] Removed stage timeline fc1696c [Kousuke Saruta] Merge branch 'timeline-viewer-feature' of github.com:sarutak/spark into timeline-viewer-feature 999ccd4 [Kousuke Saruta] Improved scalability 0fc6a31 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature 19815ae [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature 68b7540 [Kousuke Saruta] Merge branch 'timeline-viewer-feature' of github.com:sarutak/spark into timeline-viewer-feature 52b5f0b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature dec85db [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature fcdab7d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature dab7cc1 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature 09cce97 [Kousuke Saruta] Cleanuped 16f82cf [Kousuke Saruta] Cleanuped 9fb522e [Kousuke Saruta] Cleanuped d05f2c2 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature e85e9aa [Kousuke Saruta] Cleanup: Added TimelineViewUtils.scala a76e569 [Kousuke Saruta] Removed unused setting in timeline-view.css 5ce1b21 [Kousuke Saruta] Added vis.min.js, vis.min.css and vis.map to .rat-exclude 082f709 [Kousuke Saruta] Added Timeline-View feature for Applications, Jobs and Stages
* [SPARK-7504] [YARN] NullPointerException when initializing SparkContext in ↵ehnalis2015-05-151-0/+8
| | | | | | | | | | | | | | | | | | YARN-cluster mode Added a simple checking for SparkContext. Also added two rational checking against null at AM object. Author: ehnalis <zoltan.zvara@gmail.com> Closes #6083 from ehnalis/cluster and squashes the following commits: 926bd96 [ehnalis] Moved check to SparkContext. 7c89b6e [ehnalis] Remove false line. ea2a5fe [ehnalis] [SPARK-7504] [YARN] NullPointerException when initializing SparkContext in YARN-cluster mode 4924e01 [ehnalis] [SPARK-7504] [YARN] NullPointerException when initializing SparkContext in YARN-cluster mode 39e4fa3 [ehnalis] SPARK-7504 [YARN] NullPointerException when initializing SparkContext in YARN-cluster mode 9f287c5 [ehnalis] [SPARK-7504] [YARN] NullPointerException when initializing SparkContext in YARN-cluster mode
* [SPARK-7664] [WEBUI] DAG visualization: Fix incorrect link paths of DAG.Kousuke Saruta2015-05-151-2/+3
| | | | | | | | | | | | | | | | In JobPage, we can jump a StagePage when we click corresponding box of DAG viz but the link path is incorrect. When we click a box like as follows ... ![screenshot_from_2015-05-15 19 24 25](https://cloud.githubusercontent.com/assets/4736016/7651528/5f7ef824-fb3c-11e4-9518-8c9ade2dff7a.png) We jump to index page. ![screenshot_from_2015-05-15 19 24 45](https://cloud.githubusercontent.com/assets/4736016/7651534/6d666274-fb3c-11e4-971c-c3f2dc2b1da2.png) Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #6184 from sarutak/fix-link-path-of-dag-viz and squashes the following commits: faba3ba [Kousuke Saruta] Fix a incorrect link
* [SPARK-5412] [DEPLOY] Cannot bind Master to a specific hostname as per the ↵Sean Owen2015-05-151-1/+5
| | | | | | | | | | | | documentation Pass args to start-master.sh through to start-daemon.sh, as other scripts do, so that things like --host have effect on start-master.sh as per docs Author: Sean Owen <sowen@cloudera.com> Closes #6185 from srowen/SPARK-5412 and squashes the following commits: b3ce9da [Sean Owen] Pass args to start-master.sh through to start-daemon.sh, as other scripts do, so that things like --host have effect on start-master.sh as per docs
* [CORE] Protect additional test vars from early GCTim Ellison2015-05-151-2/+8
| | | | | | | | | | | Fix more places in which some test variables could be collected early by aggressive JVM optimization. Added a couple of comments to note where existing references are sufficient in the same test pattern. Author: Tim Ellison <t.p.ellison@gmail.com> Closes #6187 from tellison/DefeatEarlyGC and squashes the following commits: 27329d9 [Tim Ellison] [CORE] Protect additional test vars from early GC
* [SPARK-7233] [CORE] Detect REPL mode onceOleksii Kostyliev2015-05-152-13/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | <h3>Description</h3> Detect REPL mode once per JVM lifespan. Previous behavior was to check presence of interpreter mode every time a job was submitted. In the case of execution of multiple short-living jobs this was causing massive mutual blocks between submission threads. For more details please refer to https://issues.apache.org/jira/browse/SPARK-7233. <h3>Notes</h3> * I inverted the return value in case of catching an exception from `true` to `false`. It seems more logical to assume that if the REPL class is not found, we aren't in the interpreter mode. * I'd personally would call `classForName` with just a Spark classloader (`org.apache.spark.util.Utils#getSparkClassLoader`) but `org.apache.spark.util.Utils#getContextOrSparkClassLoader` is said to be preferable. * I struggled to come up with a concise, readable and clear unit test. Suggestions are welcome if you feel necessary. Author: Oleksii Kostyliev <etander@gmail.com> Author: Oleksii Kostyliev <okostyliev@thunderhead.com> Closes #5835 from preeze/SPARK-7233 and squashes the following commits: 69bb9e4 [Oleksii Kostyliev] SPARK-7527: fixed explanatory comment to meet style-checker requirements 26dcc24 [Oleksii Kostyliev] SPARK-7527: fixed explanatory comment to meet style-checker requirements c6f9685 [Oleksii Kostyliev] Merge remote-tracking branch 'remotes/upstream/master' into SPARK-7233 b78a983 [Oleksii Kostyliev] SPARK-7527: revert the fix and let it be addressed separately at a later stage b64d441 [Oleksii Kostyliev] SPARK-7233: inline inInterpreter parameter into instantiateClass 86e2606 [Oleksii Kostyliev] SPARK-7233, SPARK-7527: Handle interpreter mode properly. c7ee69c [Oleksii Kostyliev] Merge remote-tracking branch 'upstream/master' into SPARK-7233 d6c07fc [Oleksii Kostyliev] SPARK-7233: properly handle the inverted meaning of isInInterpreter c319039 [Oleksii Kostyliev] SPARK-7233: move inInterpreter to Utils and make it lazy
* [SPARK-7651] [MLLIB] [PYSPARK] GMM predict, predictSoft should raise error ↵FlytxtRnD2015-05-151-0/+6
| | | | | | | | | | | | on bad input In the Python API for Gaussian Mixture Model, predict() and predictSoft() methods should raise an error when the input argument is not an RDD. Author: FlytxtRnD <meethu.mathew@flytxt.com> Closes #6180 from FlytxtRnD/GmmPredictException and squashes the following commits: 4b6aa11 [FlytxtRnD] Raise error if the input to predict()/predictSoft() is not an RDD
* [SPARK-7668] [MLLIB] Preserve isTransposed property for Matrix after calling ↵Liang-Chi Hsieh2015-05-151-2/+3
| | | | | | | | | | | | map function JIRA: https://issues.apache.org/jira/browse/SPARK-7668 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6188 from viirya/fix_matrix_map and squashes the following commits: 2a7cc97 [Liang-Chi Hsieh] Preserve isTransposed property for Matrix after calling map function.
* [SPARK-7503] [YARN] Resources in .sparkStaging directory can't be cleaned up ↵Kousuke Saruta2015-05-151-25/+47
| | | | | | | | | | | | | | | | | | | | on error When we run applications on YARN with cluster mode, uploaded resources on .sparkStaging directory can't be cleaned up in case of failure of uploading local resources. You can see this issue by running following command. ``` bin/spark-submit --master yarn --deploy-mode cluster --class <someClassName> <non-existing-jar> ``` Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #6026 from sarutak/delete-uploaded-resources-on-error and squashes the following commits: caef9f4 [Kousuke Saruta] Fixed style 882f921 [Kousuke Saruta] Wrapped Client#submitApplication with try/catch blocks in order to delete resources on error 1786ca4 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into delete-uploaded-resources-on-error f61071b [Kousuke Saruta] Fixed cleanup problem
* [SPARK-7591] [SQL] Partitioning support API tweaksCheng Lian2015-05-1517-194/+195
| | | | | | | | | | | | | | | | | | | Please see [SPARK-7591] [1] for the details. /cc rxin marmbrus yhuai [1]: https://issues.apache.org/jira/browse/SPARK-7591 Author: Cheng Lian <lian@databricks.com> Closes #6150 from liancheng/spark-7591 and squashes the following commits: af422e7 [Cheng Lian] Addresses @rxin's comments 37d1738 [Cheng Lian] Fixes HadoopFsRelation partition columns initialization 2fc680a [Cheng Lian] Fixes Scala style issue 189ad23 [Cheng Lian] Removes HadoopFsRelation constructor arguments 522c24e [Cheng Lian] Adds OutputWriterFactory 047d40d [Cheng Lian] Renames FSBased* to HadoopFs*, also renamed FSBasedParquetRelation back to ParquetRelation2
* [SPARK-6258] [MLLIB] GaussianMixture Python API parity checkYanbo Liang2015-05-153-25/+75
| | | | | | | | | | | | | | | | | | | Implement Python API for major disparities of GaussianMixture cluster algorithm between Scala & Python ```scala GaussianMixture setInitialModel GaussianMixtureModel k ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #6087 from yanboliang/spark-6258 and squashes the following commits: b3af21c [Yanbo Liang] fix typo 2b645c1 [Yanbo Liang] fix doc 638b4b7 [Yanbo Liang] address comments b5bcade [Yanbo Liang] GaussianMixture Python API parity check
* [SPARK-7650] [STREAMING] [WEBUI] Move streaming css and js files to the ↵zsxwing2015-05-145-4/+14
| | | | | | | | | | | | | | streaming project cc tdas Author: zsxwing <zsxwing@gmail.com> Closes #6160 from zsxwing/SPARK-7650 and squashes the following commits: fe6ae15 [zsxwing] Fix the import order a4ffd99 [zsxwing] Merge branch 'master' into SPARK-7650 dc402b6 [zsxwing] Move streaming css and js files to the streaming project
* [CORE] Remove unreachable Heartbeat message from WorkerKan Zhang2015-05-141-3/+0
| | | | | | | | | | It doesn't look to me Heartbeat is sent to Worker from anyone. Author: Kan Zhang <kzhang@apache.org> Closes #6163 from kanzhang/deadwood and squashes the following commits: 56be118 [Kan Zhang] [core] Remove unreachable Heartbeat message from Worker
* [HOTFIX] Add workaround for SPARK-7660 to fix JavaAPISuite failures.Josh Rosen2015-05-141-0/+8
|
* [SQL] When creating partitioned table scan, explicitly create UnionRDD.Yin Huai2015-05-151-4/+7
| | | | | | | | | | Otherwise, it will cause stack overflow when there are many partitions. Author: Yin Huai <yhuai@databricks.com> Closes #6162 from yhuai/partitionUnionedRDD and squashes the following commits: fa016d8 [Yin Huai] Explicitly create UnionRDD.
* [SPARK-7098][SQL] Make the WHERE clause with timestamp show consistent resultLiang-Chi Hsieh2015-05-143-4/+8
| | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-7098 The WHERE clause with timstamp shows inconsistent results. This pr fixes it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5682 from viirya/consistent_timestamp and squashes the following commits: 171445a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into consistent_timestamp 4e98520 [Liang-Chi Hsieh] Make the WHERE clause with timestamp show consistent result.
* [SPARK-7548] [SQL] Add explode function for DataFramesMichael Armbrust2015-05-1410-51/+223
| | | | | | | | | | | | | | | | | | | | | | | Add an `explode` function for dataframes and modify the analyzer so that single table generating functions can be present in a select clause along with other expressions. There are currently the following restrictions: - only top level TGFs are allowed (i.e. no `select(explode('list) + 1)`) - only one may be present in a single select to avoid potentially confusing implicit Cartesian products. TODO: - [ ] Python Author: Michael Armbrust <michael@databricks.com> Closes #6107 from marmbrus/explodeFunction and squashes the following commits: 7ee2c87 [Michael Armbrust] whitespace 6f80ba3 [Michael Armbrust] Update dataframe.py c176c89 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explodeFunction 81b5da3 [Michael Armbrust] style d3faa05 [Michael Armbrust] fix self join case f9e1e3e [Michael Armbrust] fix python, add since 4f0d0a9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explodeFunction e710fe4 [Michael Armbrust] add java and python 52ca0dc [Michael Armbrust] [SPARK-7548][SQL] Add explode function for dataframes.
* [SPARK-7619] [PYTHON] fix docstring signatureXiangrui Meng2015-05-145-55/+52
| | | | | | | | | | Just realized that we need `\` at the end of the docstring. brkyvz Author: Xiangrui Meng <meng@databricks.com> Closes #6161 from mengxr/SPARK-7619 and squashes the following commits: e44495f [Xiangrui Meng] fix docstring signature
* [SPARK-7648] [MLLIB] Add weights and intercept to GLM wrappers in spark.mlXiangrui Meng2015-05-143-1/+43
| | | | | | | | | | | Otherwise, users can only use `transform` on the models. brkyvz Author: Xiangrui Meng <meng@databricks.com> Closes #6156 from mengxr/SPARK-7647 and squashes the following commits: 1ae3d2d [Xiangrui Meng] add weights and intercept to LogisticRegression in Python f49eb46 [Xiangrui Meng] add weights and intercept to LinearRegressionModel
* [SPARK-7645] [STREAMING] [WEBUI] Show milliseconds in the UI if the batch ↵zsxwing2015-05-146-12/+94
| | | | | | | | | | | | | | | | interval < 1 second I also updated the summary of the Streaming page. ![screen shot 2015-05-14 at 11 52 59 am](https://cloud.githubusercontent.com/assets/1000778/7640103/13cdf68e-fa36-11e4-84ec-e2a3954f4319.png) ![screen shot 2015-05-14 at 12 39 33 pm](https://cloud.githubusercontent.com/assets/1000778/7640151/4cc066ac-fa36-11e4-8494-2821d6a6f17c.png) Author: zsxwing <zsxwing@gmail.com> Closes #6154 from zsxwing/SPARK-7645 and squashes the following commits: 5db6ca1 [zsxwing] Add UIUtils.formatBatchTime e4802df [zsxwing] Show milliseconds in the UI if the batch interval < 1 second
* [SPARK-7649] [STREAMING] [WEBUI] Use window.localStorage to store the status ↵zsxwing2015-05-141-16/+4
| | | | | | | | | | | | | | rather than the url Use window.localStorage to store the status rather than the url so that the url won't be changed. cc tdas Author: zsxwing <zsxwing@gmail.com> Closes #6158 from zsxwing/SPARK-7649 and squashes the following commits: 3c56fef [zsxwing] Use window.localStorage to store the status rather than the url
* [SPARK-7643] [UI] use the correct size in RDDPage for storage info and ↵Xiangrui Meng2015-05-141-2/+5
| | | | | | | | | | | | partitions `dataDistribution` and `partitions` are `Option[Seq[_]]`. andrewor14 squito Author: Xiangrui Meng <meng@databricks.com> Closes #6157 from mengxr/SPARK-7643 and squashes the following commits: 99fe8a4 [Xiangrui Meng] use the correct size in RDDPage for storage info and partitions
* [SPARK-7598] [DEPLOY] Add aliveWorkers metrics in MasterRex Xiong2015-05-141-0/+5
| | | | | | | | | | | | | In Spark Standalone setup, when some workers are DEAD, they will stay in master worker list for a while. master.workers metrics for master is only showing the total number of workers, we need to monitor how many real ALIVE workers are there to ensure the cluster is healthy. Author: Rex Xiong <pengx@microsoft.com> Closes #6117 from twilightgod/add-aliveWorker-metrics and squashes the following commits: 6be69a5 [Rex Xiong] Fix comment for aliveWorkers metrics a882f39 [Rex Xiong] Fix style for aliveWorkers metrics 38ce955 [Rex Xiong] Add aliveWorkers metrics in Master
* Make SPARK prefix a variabletedyu2015-05-141-1/+2
| | | | | | | | | Author: tedyu <yuzhihong@gmail.com> Closes #6153 from ted-yu/master and squashes the following commits: 4e0bac5 [tedyu] Use JIRA_PROJECT_NAME as variable name ab982aa [tedyu] Make SPARK prefix a variable
* [SPARK-7278] [PySpark] DateType should find datetime.datetime acceptableksonj2015-05-141-1/+1
| | | | | | | | | | DateType should not be restricted to `datetime.date` but accept `datetime.datetime` objects as well. Could someone with a little more insight verify this? Author: ksonj <kson@siberie.de> Closes #6057 from ksonj/dates and squashes the following commits: 68a158e [ksonj] DateType should find datetime.datetime acceptable too
* [SQL][minor] rename apply for QueryPlannerWenchen Fan2015-05-142-3/+3
| | | | | | | | | | | A follow-up of https://github.com/apache/spark/pull/5624 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6142 from cloud-fan/tmp and squashes the following commits: 971a92b [Wenchen Fan] use plan instead of execute 24c5ffe [Wenchen Fan] rename apply
* [SPARK-7249] Updated Hadoop dependencies due to inconsistency in the versionsFavioVazquez2015-05-148-90/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons. Changes proposed by vanzin resulting from previous pull-request https://github.com/apache/spark/pull/5783 that did not fixed the problem correctly. Please let me know if this is the correct way of doing this, the comments of vanzin are in the pull-request mentioned. Author: FavioVazquez <favio.vazquezp@gmail.com> Closes #5786 from FavioVazquez/update-hadoop-dependencies and squashes the following commits: 11670e5 [FavioVazquez] - Added missing instance of -Phadoop-2.2 in create-release.sh 379f50d [FavioVazquez] - Added instances of -Phadoop-2.2 in create-release.sh, run-tests, scalastyle and building-spark.md - Reconstructed docs to not ask users to rely on default behavior 3f9249d [FavioVazquez] Merge branch 'master' of https://github.com/apache/spark into update-hadoop-dependencies 31bdafa [FavioVazquez] - Added missing instances in -Phadoop-1 in create-release.sh, run-tests and in the building-spark documentation cbb93e8 [FavioVazquez] - Added comment related to SPARK-3710 about hadoop-yarn-server-tests in Hadoop 2.2 that fails to pull some needed dependencies 83dc332 [FavioVazquez] - Cleaned up the main POM concerning the yarn profile - Erased hadoop-2.2 profile from yarn/pom.xml and its content was integrated into yarn/pom.xml 93f7624 [FavioVazquez] - Deleted unnecessary comments and <activation> tag on the YARN profile in the main POM 668d126 [FavioVazquez] - Moved <dependencies> <activation> and <properties> sections of the hadoop-2.2 profile in the YARN POM to the YARN profile in the root POM - Erased unnecessary hadoop-2.2 profile from the YARN POM fda6a51 [FavioVazquez] - Updated hadoop1 releases in create-release.sh due to changes in the default hadoop version set - Erased unnecessary instance of -Dyarn.version=2.2.0 in create-release.sh - Prettify comment in yarn/pom.xml 0470587 [FavioVazquez] - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in create-release.sh - Updated how the releases are made in the create-release.sh no that the default hadoop version is the 2.2.0 - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in scalastyle - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in run-tests - Better example given in the hadoop-third-party-distributions.md now that the default hadoop version is 2.2.0 a650779 [FavioVazquez] - Default value of avro.mapred.classifier has been set to hadoop2 in pom.xml - Cleaned up hadoop-2.3 and 2.4 profiles due to change in the default set in avro.mapred.classifier in pom.xml 199f40b [FavioVazquez] - Erased unnecessary CDH5-specific note in docs/building-spark.md - Remove example of instance -Phadoop-2.2 -Dhadoop.version=2.2.0 in docs/building-spark.md - Enabled hadoop-2.2 profile when the Hadoop version is 2.2.0, which is now the default .Added comment in the yarn/pom.xml to specify that. 88a8b88 [FavioVazquez] - Simplified Hadoop profiles due to new setting of global properties in the pom.xml file - Added comment to specify that the hadoop-2.2 profile is now the default hadoop profile in the pom.xml file - Erased hadoop-2.2 from related hadoop profiles now that is a no-op in the make-distribution.sh file 70b8344 [FavioVazquez] - Fixed typo in the make-distribution.sh file and added hadoop-1 in the Related profiles 287fa2f [FavioVazquez] - Updated documentation about specifying the hadoop version in building-spark. Now is clear that Spark will build against Hadoop 2.2.0 by default. - Added Cloudera CDH 5.3.3 without MapReduce example in the building-spark doc. 1354292 [FavioVazquez] - Fixed hadoop-1 version to match jenkins build profile in hadoop1.0 tests and documentation 6b4bfaf [FavioVazquez] - Cleanup in hadoop-2.x profiles since they contained mostly redundant stuff. 7e9955d [FavioVazquez] - Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons 660decc [FavioVazquez] - Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons ec91ce3 [FavioVazquez] - Updated protobuf-java version of com.google.protobuf dependancy to fix blocking error when connecting to HDFS via the Hadoop Cloudera HDFS CDH5 (fix for 2.5.0-cdh5.3.3 version)
* [SPARK-7568] [ML] ml.LogisticRegression doesn't output the right predictionDB Tsai2015-05-143-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | The difference is because we previously don't fit the intercept in Spark 1.3. Here, we change the input `String` so that the probability of instance 6 can be classified as `1.0` without any ambiguity. with lambda = 0.001 in current LOR implementation, the prediction is ``` (4, spark i j k) --> prob=[0.1596407738787411,0.8403592261212589], prediction=1.0 (5, l m n) --> prob=[0.8378325685476612,0.16216743145233883], prediction=0.0 (6, spark hadoop spark) --> prob=[0.0692663313297627,0.9307336686702373], prediction=1.0 (7, apache hadoop) --> prob=[0.9821575333444208,0.01784246665557917], prediction=0.0 ``` and the training accuracy is ``` (0, a b c d e spark) --> prob=[0.0021342419881406746,0.9978657580118594], prediction=1.0 (1, b d) --> prob=[0.9959176174854043,0.004082382514595685], prediction=0.0 (2, spark f g h) --> prob=[0.0014541569986711233,0.9985458430013289], prediction=1.0 (3, hadoop mapreduce) --> prob=[0.9982978367343561,0.0017021632656438518], prediction=0.0 ``` Author: DB Tsai <dbt@netflix.com> Closes #6109 from dbtsai/lor-example and squashes the following commits: ac63ce4 [DB Tsai] first commit
* [SPARK-7407] [MLLIB] use uid + name to identify parametersXiangrui Meng2015-05-1447-213/+452
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A param instance is strongly attached to an parent in the current implementation. So if we make a copy of an estimator or a transformer in pipelines and other meta-algorithms, it becomes error-prone to copy the params to the copied instances. In this PR, a param is identified by its parent's UID and the param name. So it becomes loosely attached to its parent and all its derivatives. The UID is preserved during copying or fitting. All components now have a default constructor and a constructor that takes a UID as input. I keep the constructors for Param in this PR to reduce the amount of diff and moved `parent` as a mutable field. This PR still needs some clean-ups, and there are several spark.ml PRs pending. I'll try to get them merged first and then update this PR. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6019 from mengxr/SPARK-7407 and squashes the following commits: c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407 520f0a2 [Xiangrui Meng] address comments 2569168 [Xiangrui Meng] fix tests 873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in shouldOwn 409ea08 [Xiangrui Meng] minor updates 83a163c [Xiangrui Meng] update JavaDeveloperApiExample 5db5325 [Xiangrui Meng] update OneVsRest 7bde7ae [Xiangrui Meng] merge master 697fdf9 [Xiangrui Meng] update Bucketizer 7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407 629d402 [Xiangrui Meng] fix LRSuite 154516f [Xiangrui Meng] merge master aa4a611 [Xiangrui Meng] fix examples/compile a4794dd [Xiangrui Meng] change Param to use to reduce the size of diff fdbc415 [Xiangrui Meng] all tests passed c255f17 [Xiangrui Meng] fix tests in ParamsSuite 818e1db [Xiangrui Meng] merge master e1160cf [Xiangrui Meng] fix tests fbc39f0 [Xiangrui Meng] pass test:compile 108937e [Xiangrui Meng] pass compile 8726d39 [Xiangrui Meng] use parent uid in Param eaeed35 [Xiangrui Meng] update Identifiable