aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-3710] Fix Yarn integration tests on Hadoop 2.2.Marcelo Vanzin2014-10-071-0/+51
| | | | | | | | | | | | | | | | It seems some dependencies are not declared when pulling the 2.2 test dependencies, so we need to add them manually for the Yarn cluster to come up. These don't seem to be necessary for 2.3 and beyond, so restrict them to the hadoop-2.2 profile. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #2682 from vanzin/SPARK-3710 and squashes the following commits: 701d4fb [Marcelo Vanzin] Add comment. 0540bdf [Marcelo Vanzin] [SPARK-3710] Fix Yarn integration tests on Hadoop 2.2.
* [SPARK-3836] [REPL] Spark REPL optionally propagate internal exceptionsAhir Reddy2014-10-071-3/+9
| | | | | | | | | | | | Optionally have the repl throw exceptions generated by interpreted code, instead of swallowing the exception and returning it as text output. This is useful when embedding the repl, otherwise it's not possible to know when user code threw an exception. Author: Ahir Reddy <ahirreddy@gmail.com> Closes #2695 from ahirreddy/repl-throw-exceptions and squashes the following commits: bad25ee [Ahir Reddy] Style Fixes f0e5b44 [Ahir Reddy] Fixed style 0d4413d [Ahir Reddy] propogate excetions from repl
* [SPARK-3412] [PySpark] Replace Epydoc with Sphinx to generate Python API docsDavies Liu2014-10-0715-206/+167
| | | | | | | | | | | | | | | | | | | | | | Retire Epydoc, use Sphinx to generate API docs. Refine Sphinx docs, also convert some docstrings into Sphinx style. It looks like: ![api doc](https://cloud.githubusercontent.com/assets/40902/4538272/9e2d4f10-4dec-11e4-8d96-6e45a8fe51f9.png) Author: Davies Liu <davies.liu@gmail.com> Closes #2689 from davies/docs and squashes the following commits: bf4a0a5 [Davies Liu] fix links 3fb1572 [Davies Liu] fix _static in jekyll 65a287e [Davies Liu] fix scripts and logo 8524042 [Davies Liu] Merge branch 'master' of github.com:apache/spark into docs d5b874a [Davies Liu] Merge branch 'master' of github.com:apache/spark into docs 4bc1c3c [Davies Liu] refactor 746d0b6 [Davies Liu] @param -> :param 240b393 [Davies Liu] replace epydoc with sphinx doc
* [SPARK-3829] Make Spark logo image on the header of HistoryPage as a link to ↵Kousuke Saruta2014-10-071-2/+4
| | | | | | | | | | | | | | | HistoryPage's page #1 There is a Spark logo on the header of HistoryPage. We can have too many HistoryPages if we run 20+ applications. So I think, it's useful if the logo is as a link to the HistoryPage's page number 1. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2690 from sarutak/SPARK-3829 and squashes the following commits: 908c109 [Kousuke Saruta] Removed extra space. 00bfbd7 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3829 dd87480 [Kousuke Saruta] Made header Spark log image as a link to History Server's top page.
* [SPARK-3398] [EC2] Have spark-ec2 intelligently wait for specific cluster statesNicholas Chammas2014-10-071-25/+86
| | | | | | | | | | | | | | | | | | | | | Instead of waiting arbitrary amounts of time for the cluster to reach a specific state, this patch lets `spark-ec2` explicitly wait for a cluster to reach a desired state. This is useful in a couple of situations: * The cluster is launching and you want to wait until SSH is available before installing stuff. * The cluster is being terminated and you want to wait until all the instances are terminated before trying to delete security groups. This patch removes the need for the `--wait` option and removes some of the time-based retry logic that was being used. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #2339 from nchammas/spark-ec2-wait-properly and squashes the following commits: 43a69f0 [Nicholas Chammas] short-circuit SSH check; linear backoff 9a9e035 [Nicholas Chammas] remove extraneous comment 26c5ed0 [Nicholas Chammas] replace print with write() bb67c06 [Nicholas Chammas] deprecate wait option; remove dead code 7969265 [Nicholas Chammas] fix long line (PEP 8) 126e4cf [Nicholas Chammas] wait for specific cluster states
* [SPARK-3832][MLlib] Upgrade Breeze dependency to 0.10DB Tsai2014-10-071-1/+1
| | | | | | | | | | | In Breeze 0.10, the L1regParam can be configured through anonymous function in OWLQN, and each component can be penalized differently. This is required for GLMNET in MLlib with L1/L2 regularization. https://github.com/scalanlp/breeze/commit/2570911026aa05aa1908ccf7370bc19cd8808a4c Author: DB Tsai <dbtsai@dbtsai.com> Closes #2693 from dbtsai/breeze0.10 and squashes the following commits: 7a0c45c [DB Tsai] In Breeze 0.10, the L1regParam can be configured through anonymous function in OWLQN, and each component can be penalized differently. This is required for GLMNET in MLlib with L1/L2 regularization. https://github.com/scalanlp/breeze/commit/2570911026aa05aa1908ccf7370bc19cd8808a4c
* [SPARK-3486][MLlib][PySpark] PySpark support for Word2VecLiquan Pei2014-10-075-7/+264
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mengxr Added PySpark support for Word2Vec Change list (1) PySpark support for Word2Vec (2) SerDe support of string sequence both on python side and JVM side (3) Test for SerDe of string sequence on JVM side Author: Liquan Pei <liquanpei@gmail.com> Closes #2356 from Ishiihara/Word2Vec-python and squashes the following commits: 476ea34 [Liquan Pei] style fixes b13a0b9 [Liquan Pei] resolve merge conflicts and minor fixes 8671eba [Liquan Pei] Merge remote-tracking branch 'upstream/master' into Word2Vec-python daf88a6 [Liquan Pei] modification according to feedback a73fa19 [Liquan Pei] clean up 3d8007b [Liquan Pei] fix findSynonyms for vector 1bdcd2e [Liquan Pei] minor fixes cdef9f4 [Liquan Pei] add missing comments b7447eb [Liquan Pei] modify according to feedback b9a7383 [Liquan Pei] cache words RDD in fit 89490bf [Liquan Pei] add tests and Word2VecModelWrapper 78bbb53 [Liquan Pei] use pickle for seq string SerDe a264b08 [Liquan Pei] Merge remote-tracking branch 'upstream/master' into Word2Vec-python ca1e5ff [Liquan Pei] fix test 68e7276 [Liquan Pei] minor style fixes 48d5e72 [Liquan Pei] Functionality improvement 0ad3ac1 [Liquan Pei] minor fix c867fdf [Liquan Pei] add Word2Vec to pyspark
* [SPARK-3790][MLlib] CosineSimilarity ExampleReza Zadeh2014-10-071-0/+107
| | | | | | | | | | | | | | | | | Provide example for `RowMatrix.columnSimilarity()` Author: Reza Zadeh <rizlar@gmail.com> Closes #2622 from rezazadeh/dimsumexample and squashes the following commits: 8f20b82 [Reza Zadeh] update comment 379066d [Reza Zadeh] cache rows 792b81c [Reza Zadeh] Address review comments e573c7a [Reza Zadeh] Average absolute error b15685f [Reza Zadeh] Use scopt. Distribute evaluation. eca3dfd [Reza Zadeh] Documentation ac96fb2 [Reza Zadeh] Compute approximation error, add command line. 4533579 [Reza Zadeh] CosineSimilarity Example
* [SPARK-3777] Display "Executor ID" for Tasks in Stage pagezsxwing2014-10-071-2/+2
| | | | | | | | | | | | | | | Now the Stage page only displays "Executor"(host) for tasks. However, there may be more than one Executors running in the same host. Currently, when some task is hung, I only know the host of the faulty executor. Therefore I have to check all executors in the host. Adding "Executor ID" in the Tasks table. would be helpful to locate the faulty executor. Here is the new page: ![add_executor_id_for_tasks](https://cloud.githubusercontent.com/assets/1000778/4505774/acb9648c-4afa-11e4-8826-8768a0a60cc9.png) Author: zsxwing <zsxwing@gmail.com> Closes #2642 from zsxwing/SPARK-3777 and squashes the following commits: 37945af [zsxwing] Put Executor ID and Host into one cell 4bbe2c7 [zsxwing] [SPARK-3777] Display "Executor ID" for Tasks in Stage page
* [SPARK-3825] Log more detail when unrolling a block failsAndrew Or2014-10-072-8/+39
| | | | | | | | | | | | | | | | | | | | | | | Before: ``` 14/10/06 16:45:42 WARN CacheManager: Not enough space to cache partition rdd_0_2 in memory! Free memory is 481861527 bytes. ``` After: ``` 14/10/07 11:08:24 WARN MemoryStore: Not enough space to cache rdd_2_0 in memory! (computed 68.8 MB so far) 14/10/07 11:08:24 INFO MemoryStore: Memory use = 1088.0 B (blocks) + 445.1 MB (scratch space shared across 8 thread(s)) = 445.1 MB. Storage limit = 459.5 MB. ``` Author: Andrew Or <andrewor14@gmail.com> Closes #2688 from andrewor14/cache-log-message and squashes the following commits: 28e33d6 [Andrew Or] Shy away from "unrolling" 5638c49 [Andrew Or] Grammar 39a0c28 [Andrew Or] Log more detail when unrolling a block fails
* [SPARK-3731] [PySpark] fix memory leak in PythonRDDDavies Liu2014-10-071-0/+5
| | | | | | | | | | The parent.getOrCompute() of PythonRDD is executed in a separated thread, it should release the memory reserved for shuffle and unrolling finally. Author: Davies Liu <davies.liu@gmail.com> Closes #2668 from davies/leak and squashes the following commits: ae98be2 [Davies Liu] fix memory leak in PythonRDD
* [SPARK-3762] clear reference of SparkEnv after stopDavies Liu2014-10-079-21/+8
| | | | | | | | | | | | | | | | | SparkEnv is cached in ThreadLocal object, so after stop and create a new SparkContext, old SparkEnv is still used by some threads, it will trigger many problems, for example, pyspark will have problem after restart SparkContext, because py4j use thread pool for RPC. This patch will clear all the references after stop a SparkEnv. cc mateiz tdas pwendell Author: Davies Liu <davies.liu@gmail.com> Closes #2624 from davies/env and squashes the following commits: a69f30c [Davies Liu] deprecate getThreadLocal ba77ca4 [Davies Liu] remove getThreadLocal(), update docs ee62bb7 [Davies Liu] cleanup ThreadLocal of SparnENV 4d0ea8b [Davies Liu] clear reference of SparkEnv after stop
* [SPARK-3808] PySpark fails to start in WindowsMasayoshi TSUZUKI2014-10-071-1/+1
| | | | | | | | | | Modified syntax error of *.cmd script. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #2669 from tsudukim/feature/SPARK-3808 and squashes the following commits: 7f804e6 [Masayoshi TSUZUKI] [SPARK-3808] PySpark fails to start in Windows
* [SPARK-3827] Very long RDD names are not rendered properly in web UIHossein2014-10-071-0/+5
| | | | | | | | | | | | | | With Spark SQL we generate very long RDD names. These names are not properly rendered in the web UI. This PR fixes the rendering issue. [SPARK-3827] #comment Linking PR with JIRA Author: Hossein <hossein@databricks.com> Closes #2687 from falaki/sparkTableUI and squashes the following commits: fd06409 [Hossein] Limit width of cell when RDD name is too long
* [SPARK-3627] - [yarn] - fix exit code and final status reporting to RMThomas Graves2014-10-074-126/+212
| | | | | | | | | | | | | | | | | | | See the description and whats handled in the jira comment: https://issues.apache.org/jira/browse/SPARK-3627?focusedCommentId=14150013&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14150013 This does not handle yarn client mode reporting of the driver to the AM. I think that should be handled when we make it an unmanaged AM. Author: Thomas Graves <tgraves@apache.org> Closes #2577 from tgravescs/SPARK-3627 and squashes the following commits: 9c2efbf [Thomas Graves] review comments e8cc261 [Thomas Graves] fix accidental typo during fixing comment 24c98e3 [Thomas Graves] rework 85f1901 [Thomas Graves] Merge remote-tracking branch 'upstream/master' into SPARK-3627 fab166d [Thomas Graves] update based on review comments 32f4dfa [Thomas Graves] switch back f0b6519 [Thomas Graves] change order of cleanup staging dir d3cc800 [Thomas Graves] SPARK-3627 - yarn - fix exit code and final status reporting to RM
* [SPARK-3479] [Build] Report failed test categoryNicholas Chammas2014-10-063-35/+126
| | | | | | | | | | | | | | | | | | | | | | This PR allows SparkQA (i.e. Jenkins) to report in its posts to GitHub what category of test failed, if one can be determined. The failure categories are: * general failure * RAT checks failed * Scala style checks failed * Python style checks failed * Build failed * Spark unit tests failed * PySpark unit tests failed * MiMa checks failed This PR also fixes the diffing logic used to determine if a patch introduces new classes. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #2606 from nchammas/report-failed-test-category and squashes the following commits: d67df03 [Nicholas Chammas] report what test category failed
* [SPARK-3773][PySpark][Doc] Sphinx build warningcocoatomo2014-10-066-23/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When building Sphinx documents for PySpark, we have 12 warnings. Their causes are almost docstrings in broken ReST format. To reproduce this issue, we should run following commands on the commit: 6e27cb630de69fa5acb510b4e2f6b980742b1957. ```bash $ cd ./python/docs $ make clean html ... /Users/<user>/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of pyspark.SparkContext.sequenceFile:4: ERROR: Unexpected indentation. /Users/<user>/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of pyspark.RDD.saveAsSequenceFile:4: ERROR: Unexpected indentation. /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:14: ERROR: Unexpected indentation. /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:16: WARNING: Definition list ends without a blank line; unexpected unindent. /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:17: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:14: ERROR: Unexpected indentation. /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:16: WARNING: Definition list ends without a blank line; unexpected unindent. /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:17: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/<user>/MyRepos/Scala/spark/python/docs/pyspark.mllib.rst:50: WARNING: missing attribute mentioned in :members: or __all__: module pyspark.mllib.regression, attribute RidgeRegressionModelLinearRegressionWithSGD /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.DecisionTreeModel.predict:3: ERROR: Unexpected indentation. ... checking consistency... /Users/<user>/MyRepos/Scala/spark/python/docs/modules.rst:: WARNING: document isn't included in any toctree ... copying static files... WARNING: html_static_path entry u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist ... build succeeded, 12 warnings. ``` Author: cocoatomo <cocoatomo77@gmail.com> Closes #2653 from cocoatomo/issues/3773-sphinx-build-warnings and squashes the following commits: 6f65661 [cocoatomo] [SPARK-3773][PySpark][Doc] Sphinx build warning
* [SPARK-3786] [PySpark] speedup testsDavies Liu2014-10-064-91/+82
| | | | | | | | | | | | | | | | | | | | | | | | | This patch try to speed up tests of PySpark, re-use the SparkContext in tests.py and mllib/tests.py to reduce the overhead of create SparkContext, remove some test cases, which did not make sense. It also improve the performance of some cases, such as MergerTests and SortTests. before this patch: real 21m27.320s user 4m42.967s sys 0m17.343s after this patch: real 9m47.541s user 2m12.947s sys 0m14.543s It almost cut the time by half. Author: Davies Liu <davies.liu@gmail.com> Closes #2646 from davies/tests and squashes the following commits: c54de60 [Davies Liu] revert change about memory limit 6a2a4b0 [Davies Liu] refactor of tests, speedup 100%
* [SPARK-2461] [PySpark] Add a toString method to GeneralizedLinearModelSandy Ryza2014-10-064-4/+9
| | | | | | | | | | | | | | | | | | | | | | Add a toString method to GeneralizedLinearModel, also change `__str__` to `__repr__` for some classes, to provide better message in repr. This PR is based on #1388, thanks to sryza! closes #1388 Author: Sandy Ryza <sandy@cloudera.com> Author: Davies Liu <davies.liu@gmail.com> Closes #2625 from davies/string and squashes the following commits: 3544aad [Davies Liu] fix LinearModel 0bcd642 [Davies Liu] Merge branch 'sandy-spark-2461' of github.com:sryza/spark 1ce5c2d [Sandy Ryza] __repr__ back to __str__ in a couple places aa9e962 [Sandy Ryza] Switch __str__ to __repr__ a0c5041 [Sandy Ryza] Add labels back in 1aa17f5 [Sandy Ryza] Match existing conventions fac1bc4 [Sandy Ryza] Fix PEP8 error f7b58ed [Sandy Ryza] SPARK-2461. Add a toString method to GeneralizedLinearModel
* [SPARK-3765][Doc] Add test information to sbt build docsscwf2014-10-051-0/+15
| | | | | | | | | | Add testing with sbt to doc ```building-spark.md``` Author: scwf <wangfei1@huawei.com> Closes #2629 from scwf/sbt-doc and squashes the following commits: fd9cf29 [scwf] add testing with sbt to docs
* Rectify gereneric parameter names between SparkContext and AccumulablePa...Nathan Kronenfeld2014-10-051-6/+6
| | | | | | | | | | | | | | AccumulableParam gave its generic parameters as 'R, T', whereas SparkContext labeled them 'T, R'. Trivial, but really confusing. I resolved this in favor of AccumulableParam, because it seemed to have some logic for its names. I also extended this minimal, but at least present, justification into the SparkContext comments. Author: Nathan Kronenfeld <nkronenfeld@oculusinfo.com> Closes #2637 from nkronenfeld/accumulators and squashes the following commits: 98d6b74 [Nathan Kronenfeld] Rectify gereneric parameter names between SparkContext and AccumulableParam
* SPARK-3794 [CORE] Building spark core fails due to inadvertent dependency on ↵Sean Owen2014-10-052-11/+10
| | | | | | | | | | | | | | Commons IO Remove references to Commons IO FileUtils and replace with pure Java version, which doesn't need to traverse the whole directory tree first. I think this method could be refined further if it would be alright to rename it and its args and break it down into two methods. I'm starting with a simple recursive rendition. Author: Sean Owen <sowen@cloudera.com> Closes #2662 from srowen/SPARK-3794 and squashes the following commits: 4cd172f [Sean Owen] Remove references to Commons IO FileUtils and replace with pure Java version, which doesn't need to traverse the whole directory tree first
* [SPARK-3776][SQL] Wrong conversion to Catalyst for Option[Product]Renat Yusupov2014-10-052-4/+19
| | | | | | | | Author: Renat Yusupov <re.yusupov@2gis.ru> Closes #2641 from r3natko/feature/catalyst_option and squashes the following commits: 55d0c06 [Renat Yusupov] [SQL] SPARK-3776: Wrong conversion to Catalyst for Option[Product]
* [SPARK-3645][SQL] Makes table caching eager by default and adds syntax for ↵Cheng Lian2014-10-0511-158/+265
| | | | | | | | | | | | | | | | lazy caching Although lazy caching for in-memory table seems consistent with the `RDD.cache()` API, it's relatively confusing for users who mainly work with SQL and not familiar with Spark internals. The `CACHE TABLE t; SELECT COUNT(*) FROM t;` pattern is also commonly seen just to ensure predictable performance. This PR makes both the `CACHE TABLE t [AS SELECT ...]` statement and the `SQLContext.cacheTable()` API eager by default, and adds a new `CACHE LAZY TABLE t [AS SELECT ...]` syntax to provide lazy in-memory table caching. Also, took the chance to make some refactoring: `CacheCommand` and `CacheTableAsSelectCommand` are now merged and renamed to `CacheTableCommand` since the former is strictly a special case of the latter. A new `UncacheTableCommand` is added for the `UNCACHE TABLE t` statement. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2513 from liancheng/eager-caching and squashes the following commits: fe92287 [Cheng Lian] Makes table caching eager by default and adds syntax for lazy caching
* [SPARK-3792][SQL] Enable JavaHiveQLSuitescwf2014-10-051-18/+9
| | | | | | | | | | Do not use TestSQLContext in JavaHiveQLSuite, that may lead to two SparkContexts in one jvm and enable JavaHiveQLSuite Author: scwf <wangfei1@huawei.com> Closes #2652 from scwf/fix-JavaHiveQLSuite and squashes the following commits: be35c91 [scwf] enable JavaHiveQLSuite
* [Minor] Trivial fix to make codes more readableLiang-Chi Hsieh2014-10-051-1/+1
| | | | | | | | | | It should just use `maxResults` there. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #2654 from viirya/trivial_fix and squashes the following commits: 1362289 [Liang-Chi Hsieh] Trivial fix to make codes more readable.
* HOTFIX: Fix unicode error in merge script.Patrick Wendell2014-10-051-2/+1
| | | | | | | | | | | | | | | | | | The merge script builds up a big command array and sometimes this contains both unicode and ascii strings. This doesn't work if you try to join them into a single string. Longer term a solution is to go and make sure the source of all strings is unicode. This patch provides a simpler solution... just print the array rather than joining. I actually prefer printing an array here anyways since joining on spaces is lossy in the case of arguments that themselves contain spaces. Author: Patrick Wendell <pwendell@gmail.com> Closes #2645 from pwendell/merge-script and squashes the following commits: 167b792 [Patrick Wendell] HOTFIX: Fix unicode error in merge script.
* [SPARK-3007][SQL] Fixes dynamic partitioning support for lower Hadoop versionsCheng Lian2014-10-051-4/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a follow up of #2226 and #2616 to fix Jenkins master SBT build failures for lower Hadoop versions (1.0.x and 2.0.x). The root cause is the semantics difference of `FileSystem.globStatus()` between different versions of Hadoop, as illustrated by the following test code: ```scala object GlobExperiments extends App { val conf = new Configuration() val fs = FileSystem.getLocal(conf) fs.globStatus(new Path("/tmp/wh/*/*/*")).foreach { status => println(status.getPath) } } ``` Target directory structure: ``` /tmp/wh ├── dir0 │   ├── dir1 │   │   └── level2 │   └── level1 └── level0 ``` Hadoop 2.4.1 result: ``` file:/tmp/wh/dir0/dir1/level2 ``` Hadoop 1.0.4 resuet: ``` file:/tmp/wh/dir0/dir1/level2 file:/tmp/wh/dir0/level1 file:/tmp/wh/level0 ``` In #2226 and #2616, we call `FileOutputCommitter.commitJob()` at the end of the job, and the `_SUCCESS` mark file is written. When working with lower Hadoop versions, due to the `globStatus()` semantics issue, `_SUCCESS` is included as a separate partition data file by `Hive.loadDynamicPartitions()`, and fails partition spec checking. The fix introduced in this PR is kind of a hack: when inserting data with dynamic partitioning, we intentionally avoid writing the `_SUCCESS` marker to workaround this issue. Hive doesn't suffer this issue because `FileSinkOperator` doesn't call `FileOutputCommitter.commitJob()`, instead, it calls `Utilities.mvFileToFinalPath()` to cleanup the output directory and then loads it into Hive warehouse by with `loadDynamicPartitions()`/`loadPartition()`/`loadTable()`. This approach is better because it handles failed job and speculative tasks properly. We should add this step to `InsertIntoHiveTable` in another PR. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2663 from liancheng/dp-hadoop-1-fix and squashes the following commits: 0177dae [Cheng Lian] Fixes dynamic partitioning support for lower Hadoop versions
* SPARK-1656: Fix potential resource leakszsxwing2014-10-053-15/+40
| | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-1656 Author: zsxwing <zsxwing@gmail.com> Closes #577 from zsxwing/SPARK-1656 and squashes the following commits: c431095 [zsxwing] Add a comment and fix the code style 2de96e5 [zsxwing] Make sure file will be deleted if exception happens 28b90dc [zsxwing] Update to follow the code style 4521d6e [zsxwing] Merge branch 'master' into SPARK-1656 afc3383 [zsxwing] Update to follow the code style 071fdd1 [zsxwing] SPARK-1656: Fix potential resource leaks
* [SPARK-3597][Mesos] Implement `killTask`.Brenden Matthews2014-10-051-0/+7
| | | | | | | | | | | The MesosSchedulerBackend did not previously implement `killTask`, resulting in an exception. Author: Brenden Matthews <brenden@diddyinc.com> Closes #2453 from brndnmtthws/implement-killtask and squashes the following commits: 23ddcdc [Brenden Matthews] [SPARK-3597][Mesos] Implement `killTask`.
* [SPARK-1860] More conservative app directory cleanup.mcheah2014-10-034-27/+62
| | | | | | | | | | | | | | | | | | | First contribution to the project, so apologize for any significant errors. This PR addresses [SPARK-1860]. The application directories are now cleaned up in a more conservative manner. Previously, app-* directories were cleaned up if the directory's timestamp was older than a given time. However, the timestamp on a directory does not reflect the modification times of the files in that directory. Therefore, app-* directories were wiped out even if the files inside them were created recently and possibly being used by Executor tasks. The solution is to change the cleanup logic to inspect all files within the app-* directory and only eliminate the app-* directory if all files in the directory are stale. Author: mcheah <mcheah@palantir.com> Closes #2609 from mccheah/worker-better-app-dir-cleanup and squashes the following commits: 87b5d03 [mcheah] [SPARK-1860] Using more string interpolation. Better error logging. 802473e [mcheah] [SPARK-1860] Cleaning up the logs generated when cleaning directories. e0a1f2e [mcheah] [SPARK-1860] Fixing broken unit test. 77a9de0 [mcheah] [SPARK-1860] More conservative app directory cleanup.
* [SPARK-3377] [SPARK-3610] Metrics can be accidentally aggregated / History ↵Kousuke Saruta2014-10-0329-85/+331
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | server log name should not be based on user input This PR is another solution for #2250 I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw following 2 problems. (1) When applications which have same spark.app.name run on cluster at the same time, some metrics names are mixed. For instance, if 2+ application is running on the cluster at the same time, each application emits the same named metric like "SparkPi.DAGScheduler.stage.failedStages" and Graphite cannot distinguish the metrics is for which application. (2) When 2+ executors run on the same machine, JVM metrics of each executors are mixed. For instance, 2+ executors running on the same node can emit the same named metric "jvm.memory" and Graphite cannot distinguish the metrics is from which application. And there is an similar issue. The directory for event logs is named using application name. Application name is defined by user and the name can includes illegal character for path names. Further more, the directory name consists of application name and System.currentTimeMillis even though each application has unique Application ID so if we run jobs which have same name, it's difficult to identify which directory is for which application. Closes #2250 Closes #1067 Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2432 from sarutak/metrics-structure-improvement2 and squashes the following commits: 3288b2b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 39169e4 [Kousuke Saruta] Fixed style 6570494 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 817e4f0 [Kousuke Saruta] Simplified MetricsSystem#buildRegistryName 67fa5eb [Kousuke Saruta] Unified MetricsSystem#registerSources and registerSinks in start 10be654 [Kousuke Saruta] Fixed style. 990c078 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 f0c7fba [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 59cc2cd [Kousuke Saruta] Modified SparkContextSchedulerCreationSuite f9b6fb3 [Kousuke Saruta] Modified style. 2cf8a0f [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 389090d [Kousuke Saruta] Replaced taskScheduler.applicationId() with getApplicationId in SparkContext#postApplicationStart ff45c89 [Kousuke Saruta] Added some test cases to MetricsSystemSuite 69c46a6 [Kousuke Saruta] Added warning logging logic to MetricsSystem#buildRegistryName 5cca0d2 [Kousuke Saruta] Added Javadoc comment to SparkContext#getApplicationId 16a9f01 [Kousuke Saruta] Added data types to be returned to some methods 6434b06 [Kousuke Saruta] Reverted changes related to ApplicationId 0413b90 [Kousuke Saruta] Deleted ApplicationId.java and ApplicationIdSuite.java a42300c [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 0fc1b09 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 42bea55 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 248935d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 f6af132 [Kousuke Saruta] Modified SchedulerBackend and TaskScheduler to return System.currentTimeMillis as an unique Application Id 1b8b53e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 97cb85c [Kousuke Saruta] Modified confliction of MimExcludes 2cdd009 [Kousuke Saruta] Modified defailt implementation of applicationId 9aadb0b [Kousuke Saruta] Modified NetworkReceiverSuite to ensure "executor.start()" is finished in test "network receiver life cycle" 3011efc [Kousuke Saruta] Added ApplicationIdSuite.scala d009c55 [Kousuke Saruta] Modified ApplicationId#equals to compare appIds dfc83fd [Kousuke Saruta] Modified ApplicationId to implement Serializable 9ff4851 [Kousuke Saruta] Modified MimaExcludes.scala to ignore createTaskScheduler method in SparkContext 4567ffc [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 6a91b14 [Kousuke Saruta] Modified SparkContextSchedulerCreationSuite, ExecutorRunnerTest and EventLoggingListenerSuite 0325caf [Kousuke Saruta] Added ApplicationId.scala 0a2fc14 [Kousuke Saruta] Modified style eabda80 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 0f890e6 [Kousuke Saruta] Modified SparkDeploySchedulerBackend and Master to pass baseLogDir instead f eventLogDir bcf25bf [Kousuke Saruta] Modified directory name for EventLogs 28d4d93 [Kousuke Saruta] Modified SparkContext and EventLoggingListener so that the directory for EventLogs is named same for Application ID 203634e [Kousuke Saruta] Modified comment in SchedulerBackend#applicationId and TaskScheduler#applicationId 424fea4 [Kousuke Saruta] Modified the subclasses of TaskScheduler and SchedulerBackend so that they can return non-optional Unique Application ID b311806 [Kousuke Saruta] Swapped last 2 arguments passed to CoarseGrainedExecutorBackend 8a2b6ec [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 086ee25 [Kousuke Saruta] Merge branch 'metrics-structure-improvement2' of github.com:sarutak/spark into metrics-structure-improvement2 e705386 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 36d2f7a [Kousuke Saruta] Added warning message for the situation we cannot get application id for the prefix for the name of metrics eea6e19 [Kousuke Saruta] Modified CoarseGrainedMesosSchedulerBackend and MesosSchedulerBackend so that we can get Application ID c229fbe [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 e719c39 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 4a93c7f [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2 4776f9e [Kousuke Saruta] Modified MetricsSystemSuite.scala efcb6e1 [Kousuke Saruta] Modified to add application id to metrics name 2ec848a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement 3ea7896 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement ead8966 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement 08e627e [Kousuke Saruta] Revert "tmp" 7b67f5a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement 45bd33d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement 93e263a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement 848819c [Kousuke Saruta] Merge branch 'metrics-structure-improvement' of github.com:sarutak/spark into metrics-structure-improvement 912a637 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement e4a4593 [Kousuke Saruta] tmp 3e098d8 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement 4603a39 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement fa7175b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement 15f88a3 [Kousuke Saruta] Modified MetricsSystem#buildRegistryName because conf.get does not return null when correspondin entry is absent 6f7dcd4 [Kousuke Saruta] Modified constructor of DAGSchedulerSource and BlockManagerSource because the instance of SparkContext is no longer used 6fc5560 [Kousuke Saruta] Modified sourceName of ExecutorSource, DAGSchedulerSource and BlockManagerSource 4e057c9 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement 85ffc02 [Kousuke Saruta] Revert "Modified sourceName of ExecutorSource, DAGSchedulerSource and BlockManagerSource" 868e326 [Kousuke Saruta] Modified MetricsSystem to set registry name with unique application-id and driver/executor-id 71609f5 [Kousuke Saruta] Modified sourceName of ExecutorSource, DAGSchedulerSource and BlockManagerSource 55debab [Kousuke Saruta] Modified SparkContext and Executor to set spark.executor.id to identifiers 4180993 [Kousuke Saruta] Modified SparkContext to retain spark.unique.app.name property in SparkConf
* [SPARK-3763] The example of building with sbt should be "sbt assembly" ↵Kousuke Saruta2014-10-031-1/+1
| | | | | | | | | | | | instead of "sbt compile" In building-spark.md, there are some examples for making assembled package with maven but the example for building with sbt is only about for compiling. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2627 from sarutak/SPARK-3763 and squashes the following commits: fadb990 [Kousuke Saruta] Modified the example to build with sbt in building-spark.md
* [SPARK-3606] [yarn] Correctly configure AmIpFilter for Yarn HA.Marcelo Vanzin2014-10-037-23/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing code only considered one of the RMs when running in Yarn HA mode, so it was possible to get errors if the active RM was not registered in the filter. The change makes use of a new API added to Yarn that returns all proxy addresses, and falls back to the old behavior if the API is not present. While there, I also made a change to look for the scheme (http or https) being used by Yarn when building the proxy URIs. Since, in the case of multiple RMs, Yarn uses commas as a separator, it was not possible anymore to use spark.filter.params to propagate this information (which used commas to delimit different config params). Instead, I added a new param (spark.filter.jsonParams) which expects a JSON string containing a map with the config data. I chose not to add it to the documentation at this point since I don't believe users will use it directly. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #2469 from vanzin/SPARK-3606 and squashes the following commits: aeb458a [Marcelo Vanzin] Undelete needed import. 65e400d [Marcelo Vanzin] Remove unused import. d121883 [Marcelo Vanzin] Use separate config for each param instead of json. 04bc156 [Marcelo Vanzin] Review feedback. 4d4d6b9 [Marcelo Vanzin] [SPARK-3606] [yarn] Correctly configure AmIpFilter for Yarn HA.
* [SPARK-3774] typo comment in bin/utils.shMasayoshi TSUZUKI2014-10-031-1/+1
| | | | | | | | | | Modified the comment of bin/utils.sh. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #2639 from tsudukim/feature/SPARK-3774 and squashes the following commits: 707b779 [Masayoshi TSUZUKI] [SPARK-3774] typo comment in bin/utils.sh
* [SPARK-3775] Not suitable error message in spark-shell.cmdMasayoshi TSUZUKI2014-10-034-4/+4
| | | | | | | | | | Modified some sentence of error message in bin\*.cmd. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #2640 from tsudukim/feature/SPARK-3775 and squashes the following commits: 3458afb [Masayoshi TSUZUKI] [SPARK-3775] Not suitable error message in spark-shell.cmd
* [SPARK-3535][Mesos] Fix resource handling.Brenden Matthews2014-10-034-8/+79
| | | | | | | | Author: Brenden Matthews <brenden@diddyinc.com> Closes #2401 from brndnmtthws/master and squashes the following commits: 4abaa5d [Brenden Matthews] [SPARK-3535][Mesos] Fix resource handling.
* [SPARK-3212][SQL] Use logical plan matching instead of temporary tables for ↵Michael Armbrust2014-10-0323-241/+567
| | | | | | | | | | | | | | | | | | | | | | | | | | | table caching _Also addresses: SPARK-1671, SPARK-1379 and SPARK-3641_ This PR introduces a new trait, `CacheManger`, which replaces the previous temporary table based caching system. Instead of creating a temporary table that shadows an existing table with and equivalent cached representation, the cached manager maintains a separate list of logical plans and their cached data. After optimization, this list is searched for any matching plan fragments. When a matching plan fragment is found it is replaced with the cached data. There are several advantages to this approach: - Calling .cache() on a SchemaRDD now works as you would expect, and uses the more efficient columnar representation. - Its now possible to provide a list of temporary tables, without having to decide if a given table is actually just a cached persistent table. (To be done in a follow-up PR) - In some cases it is possible that cached data will be used, even if a cached table was not explicitly requested. This is because we now look at the logical structure instead of the table name. - We now correctly invalidate when data is inserted into a hive table. Author: Michael Armbrust <michael@databricks.com> Closes #2501 from marmbrus/caching and squashes the following commits: 63fbc2c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into caching. 0ea889e [Michael Armbrust] Address comments. 1e23287 [Michael Armbrust] Add support for cache invalidation for hive inserts. 65ed04a [Michael Armbrust] fix tests. bdf9a3f [Michael Armbrust] Merge remote-tracking branch 'origin/master' into caching b4b77f2 [Michael Armbrust] Address comments 6923c9d [Michael Armbrust] More comments / tests 80f26ac [Michael Armbrust] First draft of improved semantics for Spark SQL caching.
* [SPARK-3007][SQL] Adds dynamic partitioning supportCheng Lian2014-10-0315-306/+450
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PR #2226 was reverted because it broke Jenkins builds for unknown reason. This debugging PR aims to fix the Jenkins build. This PR also fixes two bugs: 1. Compression configurations in `InsertIntoHiveTable` are disabled by mistake The `FileSinkDesc` object passed to the writer container doesn't have compression related configurations. These configurations are not taken care of until `saveAsHiveFile` is called. This PR moves compression code forward, right after instantiation of the `FileSinkDesc` object. 1. `PreInsertionCasts` doesn't take table partitions into account In `castChildOutput`, `table.attributes` only contains non-partition columns, thus for partitioned table `childOutputDataTypes` never equals to `tableOutputDataTypes`. This results funny analyzed plan like this: ``` == Analyzed Logical Plan == InsertIntoTable Map(partcol1 -> None, partcol2 -> None), false MetastoreRelation default, dynamic_part_table, None Project [c_0#1164,c_1#1165,c_2#1166] Project [c_0#1164,c_1#1165,c_2#1166] Project [c_0#1164,c_1#1165,c_2#1166] ... (repeats 99 times) ... Project [c_0#1164,c_1#1165,c_2#1166] Project [c_0#1164,c_1#1165,c_2#1166] Project [1 AS c_0#1164,1 AS c_1#1165,1 AS c_2#1166] Filter (key#1170 = 150) MetastoreRelation default, src, None ``` Awful though this logical plan looks, it's harmless because all projects will be eliminated by optimizer. Guess that's why this issue hasn't been caught before. Author: Cheng Lian <lian.cs.zju@gmail.com> Author: baishuo(白硕) <vc_java@hotmail.com> Author: baishuo <vc_java@hotmail.com> Closes #2616 from liancheng/dp-fix and squashes the following commits: 21935b6 [Cheng Lian] Adds back deleted trailing space f471c4b [Cheng Lian] PreInsertionCasts should take table partitions into account a132c80 [Cheng Lian] Fixes output compression 9c6eb2d [Cheng Lian] Adds tests to verify dynamic partitioning folder layout 0eed349 [Cheng Lian] Addresses @yhuai's comments 26632c3 [Cheng Lian] Adds more tests 9227181 [Cheng Lian] Minor refactoring c47470e [Cheng Lian] Refactors InsertIntoHiveTable to a Command 6fb16d7 [Cheng Lian] Fixes typo in test name, regenerated golden answer files d53daa5 [Cheng Lian] Refactors dynamic partitioning support b821611 [baishuo] pass check style 997c990 [baishuo] use HiveConf.DEFAULTPARTITIONNAME to replace hive.exec.default.partition.name 761ecf2 [baishuo] modify according micheal's advice 207c6ac [baishuo] modify for some bad indentation caea6fb [baishuo] modify code to pass scala style checks b660e74 [baishuo] delete a empty else branch cd822f0 [baishuo] do a little modify 8e7268c [baishuo] update file after test 3f91665 [baishuo(白硕)] Update Cast.scala 8ad173c [baishuo(白硕)] Update InsertIntoHiveTable.scala 051ba91 [baishuo(白硕)] Update Cast.scala d452eb3 [baishuo(白硕)] Update HiveQuerySuite.scala 37c603b [baishuo(白硕)] Update InsertIntoHiveTable.scala 98cfb1f [baishuo(白硕)] Update HiveCompatibilitySuite.scala 6af73f4 [baishuo(白硕)] Update InsertIntoHiveTable.scala adf02f1 [baishuo(白硕)] Update InsertIntoHiveTable.scala 1867e23 [baishuo(白硕)] Update SparkHadoopWriter.scala 6bb5880 [baishuo(白硕)] Update HiveQl.scala
* [SPARK-2778] [yarn] Add workaround for race in MiniYARNCluster.Marcelo Vanzin2014-10-031-4/+31
| | | | | | | | | | | | | | | Sometimes the cluster's start() method returns before the configuration having been updated, which is done by ClientRMService in, I assume, a separate thread (otherwise there would be no race). That can cause tests to fail if the old configuration data is read, since it will contain the wrong RM address. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #2605 from vanzin/SPARK-2778 and squashes the following commits: 8d02ce0 [Marcelo Vanzin] Minor cleanup. 5bebee7 [Marcelo Vanzin] [SPARK-2778] [yarn] Add workaround for race in MiniYARNCluster.
* [SPARK-2693][SQL] Supported for UDAF Hive Aggregates like PERCENTILEravipesala2014-10-032-4/+46
| | | | | | | | | | | | Implemented UDAF Hive aggregates by adding wrapper to Spark Hive. Author: ravipesala <ravindra.pesala@huawei.com> Closes #2620 from ravipesala/SPARK-2693 and squashes the following commits: a8df326 [ravipesala] Removed resolver from constructor arguments caf25c6 [ravipesala] Fixed style issues 5786200 [ravipesala] Supported for UDAF Hive Aggregates like PERCENTILE
* [SPARK-3696]Do not override the user-difined conf_dirWangTaoTheTonic2014-10-031-1/+1
| | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-3696 We see if SPARK_CONF_DIR is already defined before assignment. Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #2541 from WangTaoTheTonic/confdir and squashes the following commits: c3f31e0 [WangTaoTheTonic] Do not override the user-difined conf_dir
* SPARK-2058: Overriding SPARK_HOME/conf with SPARK_CONF_DIREugenCepoi2014-10-035-28/+71
| | | | | | | | | | | | | Update of PR #997. With this PR, setting SPARK_CONF_DIR overrides SPARK_HOME/conf (not only spark-defaults.conf and spark-env). Author: EugenCepoi <cepoi.eugen@gmail.com> Closes #2481 from EugenCepoi/SPARK-2058 and squashes the following commits: 0bb32c2 [EugenCepoi] use orElse orNull and fixing trailing percent in compute-classpath.cmd 77f35d7 [EugenCepoi] SPARK-2058: Overriding SPARK_HOME/conf with SPARK_CONF_DIR
* [SPARK-3366][MLLIB]Compute best splits distributively in decision treeqiping.lqp2014-10-035-267/+182
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, all best splits are computed on the driver, which makes the driver a bottleneck for both communication and computation. This PR fix this problem by computed best splits on executors. Instead of send all aggregate stats to the driver node, we can send aggregate stats for a node to a particular executor, using `reduceByKey` operation, then we can compute best split for this node there. Implementation details: Each node now has a nodeStatsAggregator, which save aggregate stats for all features and bins. First use mapPartition to compute node aggregate stats for all nodes in each partition. Then transform node aggregate stats to (nodeIndex, nodeStatsAggregator) pairs and use to `reduceByKey` operation to combine nodeStatsAggregator for the same node. After all stats have been combined, best splits can be computed for each node based on the node aggregate stats. Best split result is collected to driver to construct the decision tree. CC: mengxr manishamde jkbradley, please help me review this, thanks. Author: qiping.lqp <qiping.lqp@alibaba-inc.com> Author: chouqin <liqiping1991@gmail.com> Closes #2595 from chouqin/dt-dist-agg and squashes the following commits: db0d24a [chouqin] fix a minor bug and adjust code a0d9de3 [chouqin] adjust code based on comments 9f201a6 [chouqin] fix bug: statsSize -> allStatsSize a8a7ed0 [chouqin] Merge branch 'master' of https://github.com/apache/spark into dt-dist-agg f13b346 [chouqin] adjust randomforest comments c32636e [chouqin] adjust code based on comments ac6a505 [chouqin] adjust code based on comments 7bbb787 [chouqin] add comments bdd2a63 [qiping.lqp] fix test suite a75df27 [qiping.lqp] fix test suite b5b0bc2 [qiping.lqp] fix style e76414f [qiping.lqp] fix testsuite 748bd45 [qiping.lqp] fix type-mismatch bug 24eacd8 [qiping.lqp] fix type-mismatch bug 5f63d6c [qiping.lqp] add multiclassification using One-Vs-All strategy 4f56496 [qiping.lqp] fix bug f00fc22 [qiping.lqp] fix bug 532993a [qiping.lqp] Compute best splits distributively in decision tree
* [SPARK-3654][SQL] Implement all extended HiveQL statements/commands with a ↵ravipesala2014-10-023-44/+154
| | | | | | | | | | | | | | | | | separate parser combinator Created separate parser for hql. It preparses the commands like cache,uncache,add jar etc.. and then parses with HiveQl Author: ravipesala <ravindra.pesala@huawei.com> Closes #2590 from ravipesala/SPARK-3654 and squashes the following commits: bbca7dd [ravipesala] Fixed code as per admin comments. ae9290a [ravipesala] Fixed style issues as per Admin comments 898ed81 [ravipesala] Removed spaces fb24edf [ravipesala] Updated the code as per admin comments 8947d37 [ravipesala] Removed duplicate code ba26cd1 [ravipesala] Created seperate parser for hql.It pre parses the commands like cache,uncache,add jar etc.. and then parses with HiveQl
* [SQL] Initilize session state before creating CommandProcessorMichael Armbrust2014-10-021-2/+3
| | | | | | | | | | With the old ordering it was possible for commands in the HiveDriver to NPE due to the lack of configuration in the threadlocal session state. Author: Michael Armbrust <michael@databricks.com> Closes #2635 from marmbrus/initOrder and squashes the following commits: 9749850 [Michael Armbrust] Initilize session state before creating CommandProcessor
* [DEPLOY] SPARK-3759: Return the exit code of the driver processEric Eijkelenboom2014-10-021-1/+2
| | | | | | | | | | SparkSubmitDriverBootstrapper.scala now returns the exit code of the driver process, instead of always returning 0. Author: Eric Eijkelenboom <ee@userreport.com> Closes #2628 from ericeijkelenboom/master and squashes the following commits: cc4a571 [Eric Eijkelenboom] Return the exit code of the driver process
* [SPARK-3755][Core] avoid trying privileged port when request a ↵scwf2014-10-021-1/+6
| | | | | | | | | | | | | | | | non-privileged port pwendell, ```tryPort``` is not compatible with old code in last PR, this is to fix it. And after discuss with srowen renamed the title to "avoid trying privileged port when request a non-privileged port". Plz refer to the discuss for detail. Author: scwf <wangfei1@huawei.com> Closes #2623 from scwf/1-1024 and squashes the following commits: 10a4437 [scwf] add comment de3fd17 [scwf] do not try privileged port when request a non-privileged port 42cb0fa [scwf] make tryPort compatible with old code cb8cc76 [scwf] do not use port 1 - 1024
* [SPARK-3632] ConnectionManager can run out of receive threads with ↵Thomas Graves2014-10-023-79/+63
| | | | | | | | | | | | | | | | | | authentication on If you turn authentication on and you are using a lot of executors. There is a chance that all the of the threads in the handleMessageExecutor could be waiting to send a message because they are blocked waiting on authentication to happen. This can cause a temporary deadlock until the connection times out. To fix it, I got rid of the wait/notify and use a single outbox but only send security messages from it until authentication has completed. Author: Thomas Graves <tgraves@apache.org> Closes #2484 from tgravescs/cm_threads_auth and squashes the following commits: a0a961d [Thomas Graves] give it a type b6bc80b [Thomas Graves] Rework comments d6d4175 [Thomas Graves] update from comments 081b765 [Thomas Graves] cleanup 4d7f8f5 [Thomas Graves] Change to not use wait/notify while waiting for authentication
* [SPARK-3495] Block replication fails continuously when the replication ↵Tathagata Das2014-10-028-49/+544
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | target node is dead AND [SPARK-3496] Block replication by mistake chooses driver as target If a block manager (say, A) wants to replicate a block and the node chosen for replication (say, B) is dead, then the attempt to send the block to B fails. However, this continues to fail indefinitely. Even if the driver learns about the demise of the B, A continues to try replicating to B and failing miserably. The reason behind this bug is that A initially fetches a list of peers from the driver (when B was active), but never updates it after B is dead. This affects Spark Streaming as its receiver uses block replication. The solution in this patch adds the following. - Changed BlockManagerMaster to return all the peers of a block manager, rather than the requested number. It also filters out driver BlockManager. - Refactored BlockManager's replication code to handle peer caching correctly. + The peer for replication is randomly selected. This is different from past behavior where for a node A, a node B was deterministically chosen for the lifetime of the application. + If replication fails to one node, the peers are refetched. + The peer cached has a TTL of 1 second to enable discovery of new peers and using them for replication. - Refactored use of \<driver\> in BlockManager into a new method `BlockManagerId.isDriver` - Added replication unit tests (replication was not tested till now, duh!) This should not make a difference in performance of Spark workloads where replication is not used. @andrewor14 @JoshRosen Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #2366 from tdas/replication-fix and squashes the following commits: 9690f57 [Tathagata Das] Moved replication tests to a new BlockManagerReplicationSuite. 0661773 [Tathagata Das] Minor changes based on PR comments. a55a65c [Tathagata Das] Added a unit test to test replication behavior. 012afa3 [Tathagata Das] Bug fix 89f91a0 [Tathagata Das] Minor change. 68e2c72 [Tathagata Das] Made replication peer selection logic more efficient. 08afaa9 [Tathagata Das] Made peer selection for replication deterministic to block id 3821ab9 [Tathagata Das] Fixes based on PR comments. 08e5646 [Tathagata Das] More minor changes. d402506 [Tathagata Das] Fixed imports. 4a20531 [Tathagata Das] Filtered driver block manager from peer list, and also consolidated the use of <driver> in BlockManager. 7598f91 [Tathagata Das] Minor changes. 03de02d [Tathagata Das] Change replication logic to correctly refetch peers from master on failure and on new worker addition. d081bf6 [Tathagata Das] Fixed bug in get peers and unit tests to test get-peers and replication under executor churn. 9f0ac9f [Tathagata Das] Modified replication tests to fail on replication bug. af0c1da [Tathagata Das] Added replication unit tests to BlockManagerSuite