aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* [SQL][DOCS] Clarify that the server is for JDBC and ODBCMichael Armbrust2014-09-261-5/+5
| | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #2527 from marmbrus/patch-1 and squashes the following commits: a0f9f1c [Michael Armbrust] [SQL][DOCS] Clarify that the server is for JDBC and ODBC
* [Build]remove spark-staging-1030wangfei2014-09-261-12/+0
| | | | | | | | | | Since 1.1.0 has published, remove spark-staging-1030. Author: wangfei <wangfei1@huawei.com> Closes #2532 from scwf/patch-2 and squashes the following commits: bc9e00b [wangfei] remove spark-staging-1030
* Slaves file is now a template.Sarah Gerweck2014-09-261-1/+0
| | | | | | | | | | Change 0dc868e removed the `conf/slaves` file and made it a template like most of the other configuration files. This means you can no longer run `make-distribution.sh` unless you manually create a slaves file to be statically bundled in your distribution, which seems at odds with making it a template file. Author: Sarah Gerweck <sarah.a180@gmail.com> Closes #2549 from sarahgerweck/noMoreSlaves and squashes the following commits: d11d99a [Sarah Gerweck] Slaves file is now a template.
* Close #2194.Reynold Xin2014-09-260-0/+0
|
* [SPARK-3543] Write TaskContext in Java and expose it through a static accessor.Prashant Sharma2014-09-267-131/+284
| | | | | | | | | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Author: Shashank Sharma <shashank21j@gmail.com> Closes #2425 from ScrapCodes/SPARK-3543/withTaskContext and squashes the following commits: 8ae414c [Shashank Sharma] CR ee8bd00 [Prashant Sharma] Added internal API in docs comments. ddb8cbe [Prashant Sharma] Moved setting the thread local to where TaskContext is instantiated. a7d5e23 [Prashant Sharma] Added doc comments. edf945e [Prashant Sharma] Code review git add -A f716fd1 [Prashant Sharma] introduced thread local for getting the task context. 333c7d6 [Prashant Sharma] Translated Task context from scala to java.
* Revert "[SPARK-3478] [PySpark] Profile the Python tasks"Josh Rosen2014-09-267-127/+7
| | | | This reverts commit 1aa549ba9839565274a12c52fa1075b424f138a6.
* [SPARK-3393] [SQL] Align the log4j configuration for Spark & SparkSQLCLICheng Hao2014-09-261-17/+0
| | | | | | | | | | | | User may be confused for the HQL logging & configurations, we'd better provide a default templates. Both files are copied from Hive. Author: Cheng Hao <hao.cheng@intel.com> Closes #2263 from chenghao-intel/hive_template and squashes the following commits: 53bffa9 [Cheng Hao] Remove the hive-log4j.properties initialization
* [SPARK-3531][SQL]select null from table would throw a MatchErrorDaoyuan Wang2014-09-263-0/+5
| | | | | | | | Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #2396 from adrian-wang/selectnull and squashes the following commits: 2458229 [Daoyuan Wang] rebase solution
* [SPARK-3476] Remove outdated memory checks in YarnAndrew Or2014-09-261-10/+3
| | | | | | | | | | | See description in [JIRA](https://issues.apache.org/jira/browse/SPARK-3476). Author: Andrew Or <andrewor14@gmail.com> Closes #2528 from andrewor14/yarn-memory-checks and squashes the following commits: c5400cd [Andrew Or] Simplify checks e30ffac [Andrew Or] Remove outdated memory checks
* [SPARK-3695]shuffle fetch fail outputDaoyuan Wang2014-09-261-1/+1
| | | | | | | | | | should output detailed host and port in error message Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #2539 from adrian-wang/fetchfail and squashes the following commits: 6c1b1e0 [Daoyuan Wang] shuffle fetch fail output
* [SPARK-3614][MLLIB] Add minimumOccurence filtering to IDFRJ Nowling2014-09-264-5/+103
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR for [SPARK-3614](https://issues.apache.org/jira/browse/SPARK-3614) adds functionality for filtering out terms which do not appear in at least a minimum number of documents. This is implemented using a minimumOccurence parameter (default 0). When terms' document frequencies are less than minimumOccurence, their IDFs are set to 0, just like when the DF is 0. As a result, the TF-IDFs for the terms are found to be 0, as if the terms were not present in the documents. This PR makes the following changes: * Add a minimumOccurence parameter to the IDF and DocumentFrequencyAggregator classes. * Create a parameter-less constructor for IDF with a default minimumOccurence value of 0 to remain backwards-compatibility with the original IDF API. * Sets the IDFs to 0 for terms which DFs are less than minimumOccurence * Add tests to the Spark IDFSuite and Java JavaTfIdfSuite test suites * Updated the MLLib Feature Extraction programming guide to describe the new feature Author: RJ Nowling <rnowling@gmail.com> Closes #2494 from rnowling/spark-3614-idf-filter and squashes the following commits: 0aa3c63 [RJ Nowling] Fix identation e6523a8 [RJ Nowling] Remove unnecessary toDouble's from IDFSuite bfa82ec [RJ Nowling] Add space after if 30d20b3 [RJ Nowling] Add spaces around equals signs 9013447 [RJ Nowling] Add space before division operator 79978fc [RJ Nowling] Remove unnecessary semi-colon 40fd70c [RJ Nowling] Change minimumOccurence to minDocFreq in code and docs 47850ab [RJ Nowling] Changed minimumOccurence to Int from Long 9fb4093 [RJ Nowling] Remove unnecessary lines from IDF class docs 1fc09d8 [RJ Nowling] Add backwards-compatible constructor to DocumentFrequencyAggregator 1801fd2 [RJ Nowling] Fix style errors in IDF.scala 6897252 [RJ Nowling] Preface minimumOccurence members with val to make them final and immutable a200bab [RJ Nowling] Remove unnecessary else statement 4b974f5 [RJ Nowling] Remove accidentally-added import from testing c0cc643 [RJ Nowling] Add minimumOccurence filtering to IDF
* SPARK-3639 | Removed settings master in examplesaniketbhatnagar2014-09-262-13/+9
| | | | | | | | | | | | This patch removes setting of master as local in Kinesis examples so that users can set it using submit-job. Author: aniketbhatnagar <aniket.bhatnagar@gmail.com> Closes #2536 from aniketbhatnagar/Kinesis-Examples-Master-Unset and squashes the following commits: c9723ac [aniketbhatnagar] Merge remote-tracking branch 'origin/Kinesis-Examples-Master-Unset' into Kinesis-Examples-Master-Unset fec8ead [aniketbhatnagar] SPARK-3639 | Removed settings master in examples 31cdc59 [aniketbhatnagar] SPARK-3639 | Removed settings master in examples
* [SPARK-3478] [PySpark] Profile the Python tasksDavies Liu2014-09-267-7/+127
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch add profiling support for PySpark, it will show the profiling results before the driver exits, here is one example: ``` ============================================================ Profile of RDD<id=3> ============================================================ 5146507 function calls (5146487 primitive calls) in 71.094 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 5144576 68.331 0.000 68.331 0.000 statcounter.py:44(merge) 20 2.735 0.137 71.071 3.554 statcounter.py:33(__init__) 20 0.017 0.001 0.017 0.001 {cPickle.dumps} 1024 0.003 0.000 0.003 0.000 t.py:16(<lambda>) 20 0.001 0.000 0.001 0.000 {reduce} 21 0.001 0.000 0.001 0.000 {cPickle.loads} 20 0.001 0.000 0.001 0.000 copy_reg.py:95(_slotnames) 41 0.001 0.000 0.001 0.000 serializers.py:461(read_int) 40 0.001 0.000 0.002 0.000 serializers.py:179(_batched) 62 0.000 0.000 0.000 0.000 {method 'read' of 'file' objects} 20 0.000 0.000 71.072 3.554 rdd.py:863(<lambda>) 20 0.000 0.000 0.001 0.000 serializers.py:198(load_stream) 40/20 0.000 0.000 71.072 3.554 rdd.py:2093(pipeline_func) 41 0.000 0.000 0.002 0.000 serializers.py:130(load_stream) 40 0.000 0.000 71.072 1.777 rdd.py:304(func) 20 0.000 0.000 71.094 3.555 worker.py:82(process) ``` Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk by `sc.dump_profiles(path)`, such as ```python >>> sc._conf.set("spark.python.profile", "true") >>> rdd = sc.parallelize(range(100)).map(str) >>> rdd.count() 100 >>> sc.show_profiles() ============================================================ Profile of RDD<id=1> ============================================================ 284 function calls (276 primitive calls) in 0.001 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 4 0.000 0.000 0.000 0.000 serializers.py:198(load_stream) 4 0.000 0.000 0.000 0.000 {reduce} 12/4 0.000 0.000 0.001 0.000 rdd.py:2092(pipeline_func) 4 0.000 0.000 0.000 0.000 {cPickle.loads} 4 0.000 0.000 0.000 0.000 {cPickle.dumps} 104 0.000 0.000 0.000 0.000 rdd.py:852(<genexpr>) 8 0.000 0.000 0.000 0.000 serializers.py:461(read_int) 12 0.000 0.000 0.000 0.000 rdd.py:303(func) ``` The profiling is disabled by default, can be enabled by "spark.python.profile=true". Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump" Author: Davies Liu <davies.liu@gmail.com> Closes #2351 from davies/profiler and squashes the following commits: 7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles() 2b0daf2 [Davies Liu] fix docs 7a56c24 [Davies Liu] bugfix cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler 116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler 09d02c3 [Davies Liu] Merge branch 'master' into profiler c23865c [Davies Liu] Merge branch 'master' into profiler 15d6f18 [Davies Liu] add docs for two configs dadee1a [Davies Liu] add docs string and clear profiles after show or dump 4f8309d [Davies Liu] address comment, add tests 0a5b6eb [Davies Liu] fix Python UDF 4b20494 [Davies Liu] add profile for python
* [SPARK-3686][STREAMING] Wait for sink to commit the channel before check...Hari Shreedharan2014-09-254-10/+48
| | | | | | | | | | | ...ing for the channel size. Author: Hari Shreedharan <hshreedharan@apache.org> Closes #2531 from harishreedharan/sparksinksuite-fix and squashes the following commits: 30393c1 [Hari Shreedharan] Use more deterministic method to figure out when batches come in. 6ce9d8b [Hari Shreedharan] [SPARK-3686][STREAMING] Wait for sink to commit the channel before checking for the channel size.
* SPARK-2634: Change MapOutputTrackerWorker.mapStatuses to ConcurrentHashMapzsxwing2014-09-251-1/+7
| | | | | | | | | | MapOutputTrackerWorker.mapStatuses is used concurrently, it should be thread-safe. This bug has already been fixed in #1328. Nevertheless, considering #1328 won't be merged soon, I send this trivial fix and hope this issue can be solved soon. Author: zsxwing <zsxwing@gmail.com> Closes #1541 from zsxwing/SPARK-2634 and squashes the following commits: d450053 [zsxwing] SPARK-2634: Change MapOutputTrackerWorker.mapStatuses to ConcurrentHashMap
* [SPARK-3584] sbin/slaves doesn't work when we use password authentication ↵Kousuke Saruta2014-09-255-10/+30
| | | | | | | | | | | | | | | | | for SSH Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2444 from sarutak/slaves-scripts-modification and squashes the following commits: eff7394 [Kousuke Saruta] Improve the description about Cluster Launch Script in docs/spark-standalone.md 7858225 [Kousuke Saruta] Modified sbin/slaves to use the environment variable "SPARK_SSH_FOREGROUND" as a flag 53d7121 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into slaves-scripts-modification e570431 [Kousuke Saruta] Added a description for SPARK_SSH_FOREGROUND variable 7120a0c [Kousuke Saruta] Added a description about default host for sbin/slaves 1bba8a9 [Kousuke Saruta] Added SPARK_SSH_FOREGROUND flag to sbin/slaves 88e2f17 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into slaves-scripts-modification 297e75d [Kousuke Saruta] Modified sbin/slaves not to export HOSTLIST
* [SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data.Aaron Staple2014-09-257-30/+91
| | | | | | | | | | | | | | | | | | | Add warnings to KMeans, GeneralizedLinearAlgorithm, and computeSVD when called with input data that is not cached. KMeans is implemented iteratively, and I believe that GeneralizedLinearAlgorithm’s current optimizers are iterative and its future optimizers are also likely to be iterative. RowMatrix’s computeSVD is iterative against an RDD when run in DistARPACK mode. ALS and DecisionTree are iterative as well, but they implement RDD caching internally so do not require a warning. I added a warning to GeneralizedLinearAlgorithm rather than inside its optimizers, where the iteration actually occurs, because internally GeneralizedLinearAlgorithm maps its input data to an uncached RDD before passing it to an optimizer. (In other words, the warning would be printed for every GeneralizedLinearAlgorithm run, regardless of whether its input is cached, if the warning were in GradientDescent or other optimizer.) I assume that use of an uncached RDD by GeneralizedLinearAlgorithm is intentional, and that the mapping there (adding label, intercepts and scaling) is a lightweight operation. Arguably a user calling an optimizer such as GradientDescent will be knowledgable enough to cache their data without needing a log warning, so lack of a warning in the optimizers may be ok. Some of the documentation examples making use of these iterative algorithms did not cache their training RDDs (while others did). I updated the examples to always cache. I also fixed some (unrelated) minor errors in the documentation examples. Author: Aaron Staple <aaron.staple@gmail.com> Closes #2347 from staple/SPARK-1484 and squashes the following commits: bd49701 [Aaron Staple] Address review comments. ab2d4a4 [Aaron Staple] Disable warnings on python code path. a7a0f99 [Aaron Staple] Change code comments per review comments. 7cca1dc [Aaron Staple] Change warning message text. c77e939 [Aaron Staple] [SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data. 3b6c511 [Aaron Staple] Minor doc example fixes.
* [SPARK-3690] Closing shuffle writers we swallow more important exceptionepahomov2014-09-251-2/+7
| | | | | | | | Author: epahomov <pahomov.egor@gmail.com> Closes #2537 from epahomov/SPARK-3690 and squashes the following commits: a0b7de4 [epahomov] [SPARK-3690] Closing shuffle writers we swallow more important exception
* SPARK-2932 [STREAMING] Move MasterFailureTest out of "main" source directorySean Owen2014-09-252-36/+8
| | | | | | | | | | | | (HT @vanzin) Whatever the reason was for having this test class in `main`, if there is one, appear to be moot. This may have been a result of earlier streaming test reorganization. This simply puts `MasterFailureTest` back under `test/`, removes some redundant copied code, and touches up a few tiny inspection warnings along the way. Author: Sean Owen <sowen@cloudera.com> Closes #2399 from srowen/SPARK-2932 and squashes the following commits: 3909411 [Sean Owen] Move MasterFailureTest to src/test, and remove redundant TestOutputStream
* [SPARK-2778] [yarn] Add yarn integration tests.Marcelo Vanzin2014-09-248-10/+229
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a couple of, currently, very simple integration tests to make sure both client and cluster modes are working. The tests don't do much yet other than run a simple job, but the plan is to enhance them after we get the framework in. The cluster tests are noisy, so redirect all log output to a file like other tests do. Copying the conf around sucks but it's less work than messing with maven/sbt and having to clean up other projects. Note the test is only added for yarn-stable. The code compiles against yarn-alpha but there are two issues I ran into that I could not overcome: - an old netty dependency kept creeping into the classpath and causing akka to not work, when using sbt; the old netty was correctly suppressed under maven. - MiniYARNCluster kept failing to execute containers because it did not create the NM's local dir itself; this is apparently a known behavior, but I'm not sure how to work around it. None of those issues are present with the stable Yarn. Also, these tests are a little slow to run. Apparently Spark doesn't yet tag tests (so that these could be isolated in a "slow" batch), so this is something to keep in mind. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #2257 from vanzin/yarn-tests and squashes the following commits: 6d5b84e [Marcelo Vanzin] Fix wrong system property being set. 8b0933d [Marcelo Vanzin] Merge branch 'master' into yarn-tests 5c2b56f [Marcelo Vanzin] Use custom log4j conf for Yarn containers. ec73f17 [Marcelo Vanzin] More review feedback. 67f5b02 [Marcelo Vanzin] Review feedback. f01517c [Marcelo Vanzin] Review feedback. 68fbbbf [Marcelo Vanzin] Use older constructor available in older Hadoop releases. d07ef9a [Marcelo Vanzin] Merge branch 'master' into yarn-tests add8416 [Marcelo Vanzin] [SPARK-2778] [yarn] Add yarn integration tests.
* [SPARK-546] Add full outer join to RDD and DStream.Aaron Staple2014-09-2411-7/+250
| | | | | | | | | | | | | leftOuterJoin and rightOuterJoin are already implemented. This patch adds fullOuterJoin. Author: Aaron Staple <aaron.staple@gmail.com> Closes #1395 from staple/SPARK-546 and squashes the following commits: 1f5595c [Aaron Staple] Fix python style 7ac0aa9 [Aaron Staple] [SPARK-546] Add full outer join to RDD and DStream. 3b5d137 [Aaron Staple] In JavaPairDStream, make class tag specification in rightOuterJoin consistent with other functions. 31f2956 [Aaron Staple] Fix left outer join documentation comments.
* [SPARK-3615][Streaming]Fix Kafka unit test hard coded Zookeeper port issuejerryshao2014-09-242-14/+34
| | | | | | | | | | Details can be seen in [SPARK-3615](https://issues.apache.org/jira/browse/SPARK-3615). Author: jerryshao <saisai.shao@intel.com> Closes #2483 from jerryshao/SPARK_3615 and squashes the following commits: 8555563 [jerryshao] Fix Kafka unit test hard coded Zookeeper port issue
* [SPARK-3679] [PySpark] pickle the exact globals of functionsDavies Liu2014-09-242-6/+54
| | | | | | | | | | | | | | function.func_code.co_names has all the names used in the function, including name of attributes. It will pickle some unnecessary globals if there is a global having the same name with attribute (in co_names). There is a regression introduced by #2144, revert part of changes in that PR. cc JoshRosen Author: Davies Liu <davies.liu@gmail.com> Closes #2522 from davies/globals and squashes the following commits: dfbccf5 [Davies Liu] fix bug while pickle globals of function
* [SPARK-3634] [PySpark] User's module should take precedence over system modulesDavies Liu2014-09-244-8/+48
| | | | | | | | | | | | | | | Python modules added through addPyFile should take precedence over system modules. This patch put the path for user added module in the front of sys.path (just after ''). Author: Davies Liu <davies.liu@gmail.com> Closes #2492 from davies/path and squashes the following commits: 4a2af78 [Davies Liu] fix tests f7ff4da [Davies Liu] ad license header 6b0002f [Davies Liu] add tests c16c392 [Davies Liu] put addPyFile in front of sys.path
* [SPARK-3659] Set EC2 version to 1.1.0 and update version mapShivaram Venkataraman2014-09-241-2/+2
| | | | | | | | | | This brings the master branch in sync with branch-1.1 Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #2510 from shivaram/spark-ec2-version and squashes the following commits: bb0dd16 [Shivaram Venkataraman] Set EC2 version to 1.1.0 and update version map
* [Build] Diff from branch pointNicholas Chammas2014-09-241-2/+2
| | | | | | | | | | | | | Sometimes Jenkins posts [spurious reports of new classes being added](https://github.com/apache/spark/pull/2339#issuecomment-56570170). I believe this stems from diffing the patch against `master`, as opposed to against `master...`, which starts from the commit the PR was branched from. This patch fixes that behavior. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #2512 from nchammas/diff-only-commits-ahead and squashes the following commits: c065599 [Nicholas Chammas] comment typo fix a453c67 [Nicholas Chammas] diff from branch point
* [SPARK-1853] Show Streaming application code context (file, line number) in ↵Mubarak Seyed2014-09-236-58/+153
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Spark Stages UI This is a refactored version of the original PR https://github.com/apache/spark/pull/1723 my mubarak Please take a look andrewor14, mubarak Author: Mubarak Seyed <mubarak.seyed@gmail.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #2464 from tdas/streaming-callsite and squashes the following commits: dc54c71 [Tathagata Das] Made changes based on PR comments. 390b45d [Tathagata Das] Fixed minor bugs. 904cd92 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-callsite 7baa427 [Tathagata Das] Refactored getCallSite and setCallSite to make it simpler. Also added unit test for DStream creation site. b9ed945 [Mubarak Seyed] Adding streaming utils c461cf4 [Mubarak Seyed] Merge remote-tracking branch 'upstream/master' ceb43da [Mubarak Seyed] Changing default regex function name 8c5d443 [Mubarak Seyed] Merge remote-tracking branch 'upstream/master' 196121b [Mubarak Seyed] Merge remote-tracking branch 'upstream/master' 491a1eb [Mubarak Seyed] Removing streaming visibility from getRDDCreationCallSite in DStream 33a7295 [Mubarak Seyed] Fixing review comments: Merging both setCallSite methods c26d933 [Mubarak Seyed] Merge remote-tracking branch 'upstream/master' f51fd9f [Mubarak Seyed] Fixing scalastyle, Regex for Utils.getCallSite, and changing method names in DStream 5051c58 [Mubarak Seyed] Getting return value of compute() into variable and call setCallSite(prevCallSite) only once. Adding return for other code paths (for None) a207eb7 [Mubarak Seyed] Fixing code review comments ccde038 [Mubarak Seyed] Removing Utils import from MappedDStream 2a09ad6 [Mubarak Seyed] Changes in Utils.scala for SPARK-1853 1d90cc3 [Mubarak Seyed] Changes for SPARK-1853 5f3105a [Mubarak Seyed] Merge remote-tracking branch 'upstream/master' 70f494f [Mubarak Seyed] Changes for SPARK-1853 1500deb [Mubarak Seyed] Changes in Spark Streaming UI 9d38d3c [Mubarak Seyed] [SPARK-1853] Show Streaming application code context (file, line number) in Spark Stages UI d466d75 [Mubarak Seyed] Changes for spark streaming UI
* [SPARK-3653] Respect SPARK_*_MEMORY for cluster modeAndrew Or2014-09-231-0/+4
| | | | | | | | | | | | `SPARK_DRIVER_MEMORY` was only used to start the `SparkSubmit` JVM, which becomes the driver only in client mode but not cluster mode. In cluster mode, this property is simply not propagated to the worker nodes. `SPARK_EXECUTOR_MEMORY` is picked up from `SparkContext`, but in cluster mode the driver runs on one of the worker machines, where this environment variable may not be set. Author: Andrew Or <andrewor14@gmail.com> Closes #2500 from andrewor14/memory-env-vars and squashes the following commits: 6217b38 [Andrew Or] Respect SPARK_*_MEMORY for cluster mode
* SPARK-3612. Executor shouldn't quit if heartbeat message fails to reach ...Sandy Ryza2014-09-231-5/+11
| | | | | | | | | | ...the driver Author: Sandy Ryza <sandy@cloudera.com> Closes #2487 from sryza/sandy-spark-3612 and squashes the following commits: 2b7353d [Sandy Ryza] SPARK-3612. Executor shouldn't quit if heartbeat message fails to reach the driver
* [SPARK-3647] Add more exceptions to Guava relocation.Marcelo Vanzin2014-09-233-1/+31
| | | | | | | | | | | | | | | | | | Guava's Optional refers to some package private classes / methods, and when those are relocated the code stops working, throwing exceptions. So add the affected classes to the exception list too, and add a unit test. (Note that this unit test only really makes sense in maven, since we don't relocate in the sbt build. Also, JavaAPISuite doesn't seem to be run by "mvn test" - I had to manually add command line options to enable it.) Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #2496 from vanzin/SPARK-3647 and squashes the following commits: 84f58d7 [Marcelo Vanzin] [SPARK-3647] Add more exceptions to Guava relocation.
* [SPARK-3646][SQL] Copy SQL configuration from SparkConf when a SQLContext is ↵Michael Armbrust2014-09-233-2/+20
| | | | | | | | | | | | created. This will allow us to take advantage of things like the spark.defaults file. Author: Michael Armbrust <michael@databricks.com> Closes #2493 from marmbrus/copySparkConf and squashes the following commits: 0bd1377 [Michael Armbrust] Copy SQL configuration from SparkConf when a SQLContext is created.
* [SPARK-3268][SQL] DoubleType, FloatType and DecimalType modulus supportVenkata Ramana Gollamudi2014-09-235-0/+44
| | | | | | | | | | | | | | | | | | | | Supported modulus operation using % operator on fractional datatypes FloatType, DoubleType and DecimalType Example: SELECT 1388632775.0 % 60 from tablename LIMIT 1 Author : Venkata Ramana Gollamudi ramana.gollamudihuawei.com Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com> Closes #2457 from gvramana/double_modulus_support and squashes the following commits: 79172a8 [Venkata Ramana Gollamudi] Add hive cache to testcase c09bd5b [Venkata Ramana Gollamudi] Added a HiveQuerySuite testcase 193fa81 [Venkata Ramana Gollamudi] corrected testcase 3624471 [Venkata Ramana Gollamudi] modified testcase e112c09 [Venkata Ramana Gollamudi] corrected the testcase 513d0e0 [Venkata Ramana Gollamudi] modified to add modulus support to fractional types float,double,decimal 296d253 [Venkata Ramana Gollamudi] modified to add modulus support to fractional types float,double,decimal
* [SPARK-3481][SQL] removes the evil MINOR HACKwangfei2014-09-231-2/+0
| | | | | | | | | | a follow up of https://github.com/apache/spark/pull/2377 and https://github.com/apache/spark/pull/2352, see detail there. Author: wangfei <wangfei1@huawei.com> Closes #2505 from scwf/patch-6 and squashes the following commits: 4874ec8 [wangfei] removes the evil MINOR HACK
* SPARK-2745 [STREAMING] Add Java friendly methods to Duration classSean Owen2014-09-236-2/+425
| | | | | | | | | | | | | | | tdas is this what you had in mind for this JIRA? I saw this one and thought it would be easy to take care of, and helpful as I use streaming from Java. I could do the same for `Time`? Happy to do so. Author: Sean Owen <sowen@cloudera.com> Closes #2403 from srowen/SPARK-2745 and squashes the following commits: 5a9e706 [Sean Owen] Change "Duration" to "Durations" to avoid changing Duration case class API bda301c [Sean Owen] Just delegate to Scala binary operator syntax to avoid scalastyle warning 7dde949 [Sean Owen] Disable scalastyle for false positives. Add Java static factory methods seconds(), minutes() to Duration. Add Java-friendly methods to Time too, and unit tests. Remove unnecessary math.floor from Time.floor() 4dee32e [Sean Owen] Add named methods to Duration in parallel to symbolic methods for Java-friendliness. Also add unit tests for Duration, in Scala and Java.
* [SPARK-3536][SQL] SELECT on empty parquet table throws exceptionravipesala2014-09-232-2/+14
| | | | | | | | | | | | It returns null metadata from parquet if querying on empty parquet file while calculating splits.So added null check and returns the empty splits. Author : ravipesala ravindra.pesalahuawei.com Author: ravipesala <ravindra.pesala@huawei.com> Closes #2456 from ravipesala/SPARK-3536 and squashes the following commits: 1e81a50 [ravipesala] Fixed the issue when querying on empty parquet file.
* [SPARK-3582][SQL] not limit argument type for hive simple udfDaoyuan Wang2014-09-232-22/+4
| | | | | | | | | | Since we have moved to `ConventionHelper`, it is quite easy to avoid call `javaClassToDataType` in hive simple udf. This will solve SPARK-3582. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #2506 from adrian-wang/spark3582 and squashes the following commits: 450c28e [Daoyuan Wang] not limit argument type for hive simple udf
* [SPARK-3598][SQL]cast to timestamp should be the same as hiveDaoyuan Wang2014-09-2311-15/+50
| | | | | | | | | | | | | | | | | this patch fixes timestamp smaller than 0 and cast int as timestamp select cast(1000 as timestamp) from src limit 1; should return 1970-01-01 00:00:01, but we now take it as 1000 seconds. also, current implementation has bug when the time is before 1970-01-01 00:00:00. rxin marmbrus chenghao-intel Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #2458 from adrian-wang/timestamp and squashes the following commits: 4274b1d [Daoyuan Wang] set test not related to timezone 1234f66 [Daoyuan Wang] fix timestamp smaller than 0 and cast int as timestamp
* [SPARK-3304] [YARN] ApplicationMaster's Finish status is wrong when uncaught ↵Kousuke Saruta2014-09-231-12/+54
| | | | | | | | | | | | | | | | | | | | | | | | exception is thrown from ReporterThread Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2198 from sarutak/SPARK-3304 and squashes the following commits: 2696237 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3304 5b80363 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3304 4eb0a3e [Kousuke Saruta] Remoed the description about spark.yarn.scheduler.reporterThread.maxFailure 9741597 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3304 f7538d4 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3304 358ef8d [Kousuke Saruta] Merge branch 'SPARK-3304' of github.com:sarutak/spark into SPARK-3304 0d138c6 [Kousuke Saruta] Revert "tmp" f8da10a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3304 b6e9879 [Kousuke Saruta] tmp 8d256ed [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3304 13b2652 [Kousuke Saruta] Merge branch 'SPARK-3304' of github.com:sarutak/spark into SPARK-3304 2711e15 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3304 c081f8e [Kousuke Saruta] Modified ApplicationMaster to handle exception in ReporterThread itself 0bbd3a6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3304 a6982ad [Kousuke Saruta] Added ability handling uncaught exception thrown from Reporter thread
* [SPARK-3477] Clean up code in Yarn Client / ClientBaseAndrew Or2014-09-239-662/+738
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is part of a broader effort to clean up the Yarn integration code after #2020. The high-level changes in this PR include: - Removing duplicate code, especially across the alpha and stable APIs - Simplify unnecessarily complex method signatures and hierarchies - Rename unclear variable and method names - Organize logging output produced when the user runs Spark on Yarn - Extensively add documentation - Privatize classes where possible I have tested the stable API on a Hadoop 2.4 cluster. I tested submitting a jar that references classes in other jars in both client and cluster mode. I also made changes in the alpha API, though I do not have access to an alpha cluster. I have verified that it compiles, but it would be ideal if others can help test it. For those interested in some examples in detail, please read on. -------------------------------------------------------------------------------------------------------- ***Appendix*** - The loop to `getApplicationReport` from the RM is duplicated in 4 places: in the stable `Client`, alpha `Client`, and twice in `YarnClientSchedulerBackend`. We should not have different loops for client and cluster deploy modes. - There are many fragmented small helper methods that are only used once and should just be inlined. For instance, `ClientBase#getLocalPath` returns `null` on certain conditions, and its only caller `ClientBase#addFileToClasspath` checks whether the value returned is `null`. We could just have the caller check on that same condition to avoid passing `null`s around. - In `YarnSparkHadoopUtil#addToEnvironment`, we take in an argument `classpathSeparator` that always has the same value upstream (i.e. `File.pathSeparator`). This argument is now removed from the signature and all callers of this method upstream. - `ClientBase#copyRemoteFile` is now renamed to `copyFileToRemote`. It was unclear whether we are copying a remote file to our local file system, or copying a locally visible file to a remote file system. Also, even the content of the method has inaccurately named variables. We use `val remoteFs` to signify the file system of the locally visible file and `val fs` to signify the remote, destination file system. These are now renamed `srcFs` and `destFs` respectively. - We currently log the AM container's environment and resource mappings directly as Scala collections. This is incredibly hard to read and probably too verbose for the average Spark user. In other modes (e.g. standalone), we also don't log the launch commands by default, so the logging level of these information is now set to `DEBUG`. - None of these classes (`Client`, `ClientBase`, `YarnSparkHadoopUtil` etc.) is intended to be used by a Spark application (the user should go through Spark submit instead). At the very least they should be `private[spark]`. Author: Andrew Or <andrewor14@gmail.com> Closes #2350 from andrewor14/yarn-cleanup and squashes the following commits: 39e8c7b [Andrew Or] Address review comments 6619f9b [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-cleanup 2ca6d64 [Andrew Or] Improve logging in application monitor a3b9693 [Andrew Or] Minor changes 7dd6298 [Andrew Or] Simplify ClientBase#monitorApplication 547487c [Andrew Or] Provide default values for null application report entries a0ad1e9 [Andrew Or] Fix class not found error 1590141 [Andrew Or] Address review comments 45ccdea [Andrew Or] Remove usages of getAMMemory d8e33b6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-cleanup ed0b42d [Andrew Or] Fix alpha compilation error c0587b4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-cleanup 6d74888 [Andrew Or] Minor comment changes 6573c1d [Andrew Or] Clean up, simplify and document code for setting classpaths e4779b6 [Andrew Or] Clean up log messages + variable naming in ClientBase 8766d37 [Andrew Or] Heavily add documentation to Client* classes + various clean-ups 6c94d79 [Andrew Or] Various cleanups in ClientBase and ClientArguments ef7069a [Andrew Or] Clean up YarnClientSchedulerBackend more 6de9072 [Andrew Or] Guard against potential NPE in debug logging mode fabe4c4 [Andrew Or] Reuse more code in YarnClientSchedulerBackend 3f941dc [Andrew Or] First cut at simplifying the Client (stable and alpha)
* [YARN] SPARK-2668: Add variable of yarn log directory for reference from the ↵peng.zhang2014-09-233-0/+8
| | | | | | | | | | | | | | | | | | log4j configuration Assign value of yarn container log directory to java opts "spark.yarn.app.container.log.dir", So user defined log4j.properties can reference this value and write log to YARN container's log directory. Otherwise, user defined file appender will only write to container's CWD, and log files in CWD will not be displayed on YARN UI,and either cannot be aggregated to HDFS log directory after job finished. User defined log4j.properties reference example: log4j.appender.rolling_file.File = ${spark.yarn.app.container.log.dir}/spark.log Author: peng.zhang <peng.zhang@xiaomi.com> Closes #1573 from renozhang/yarn-log-dir and squashes the following commits: 16c5cb8 [peng.zhang] Update doc f2b5e2a [peng.zhang] Change variable's name, and update running-on-yarn.md 503ea2d [peng.zhang] Support log4j log to yarn container dir
* [SPARK-3578] Fix upper bound in GraphGenerators.sampleLogNormalAnkur Dave2014-09-222-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | GraphGenerators.sampleLogNormal is supposed to return an integer strictly less than maxVal. However, it violates this guarantee. It generates its return value as follows: ```scala var X: Double = maxVal while (X >= maxVal) { val Z = rand.nextGaussian() X = math.exp(mu + sigma*Z) } math.round(X.toFloat) ``` When X is sampled to be close to (but less than) maxVal, then it will pass the while loop condition, but the rounded result will be equal to maxVal, which will violate the guarantee. For example, if maxVal is 5 and X is 4.9, then X < maxVal, but `math.round(X.toFloat)` is 5. This PR instead rounds X before checking the loop condition, guaranteeing that the condition will hold for the return value. Author: Ankur Dave <ankurdave@gmail.com> Closes #2439 from ankurdave/SPARK-3578 and squashes the following commits: f6655e5 [Ankur Dave] Go back to math.floor 5900c22 [Ankur Dave] Round X in loop condition 6fd5fb1 [Ankur Dave] Run sampleLogNormal bounds check 1000 times 1638598 [Ankur Dave] Round down in sampleLogNormal to guarantee upper bound
* Update docs to use jsonRDD instead of wrong jsonRdd.Grega Kespret2014-09-221-3/+3
| | | | | | | | Author: Grega Kespret <grega.kespret@gmail.com> Closes #2479 from gregakespret/patch-1 and squashes the following commits: dd6b90a [Grega Kespret] Update docs to use jsonRDD instead of wrong jsonRdd.
* [MLLib] Fix example code variable name misspelling in MLLib Feature ↵RJ Nowling2014-09-221-1/+1
| | | | | | | | | | Extraction guide Author: RJ Nowling <rnowling@gmail.com> Closes #2459 from rnowling/tfidf-fix and squashes the following commits: b370a91 [RJ Nowling] Fix variable name misspelling in MLLib Feature Extraction guide
* [Minor]ignore .idea_moduleswangfei2014-09-211-0/+1
| | | | | | | | | | ignore .idea_modules , ```sbt/sbt gen-idea``` generate this dir. Author: wangfei <wangfei1@huawei.com> Closes #2476 from scwf/patch-4 and squashes the following commits: e6ab88a [wangfei] ignore .idea_modules
* [SPARK-3595] Respect configured OutputCommitters when calling saveAsHadoopFileIan Hummel2014-09-213-25/+91
| | | | | | | | | | | | | Addresses the issue in https://issues.apache.org/jira/browse/SPARK-3595, namely saveAsHadoopFile hardcoding the OutputCommitter. This is not ideal when running Spark jobs that write to S3, especially when running them from an EMR cluster where the default OutputCommitter is a DirectOutputCommitter. Author: Ian Hummel <ian@themodernlife.net> Closes #2450 from themodernlife/spark-3595 and squashes the following commits: f37a0e5 [Ian Hummel] Update based on comments from pwendell a11d9f3 [Ian Hummel] Fix formatting 4359664 [Ian Hummel] Add an example showing usage 8b6be94 [Ian Hummel] Add ability to specify OutputCommitter, espcially useful when writing to an S3 bucket from an EMR cluster
* MAINTENANCE: Automated closing of pull requests.Patrick Wendell2014-09-200-0/+0
| | | | | | | | | | | | This commit exists to close the following pull requests on Github: Closes #1328 (close requested by 'pwendell') Closes #2314 (close requested by 'pwendell') Closes #997 (close requested by 'pwendell') Closes #550 (close requested by 'pwendell') Closes #1506 (close requested by 'pwendell') Closes #2423 (close requested by 'mengxr') Closes #554 (close requested by 'joshrosen')
* [SPARK-3599]Avoid loading properties file frequentlyWangTao2014-09-202-9/+10
| | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-3599 Author: WangTao <barneystinson@aliyun.com> Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #2454 from WangTaoTheTonic/avoidLoadingFrequently and squashes the following commits: 3681182 [WangTao] do not use clone 7dca036 [WangTao] use lazy val instead 2a79f26 [WangTaoTheTonic] Avoid loaing properties file frequently
* [SPARK-3414][SQL] Replace LowerCaseSchema with ResolverMichael Armbrust2014-09-2015-99/+125
| | | | | | | | | | | | | | | | | | | | | | | | | | **This PR introduces a subtle change in semantics for HiveContext when using the results in Python or Scala. Specifically, while resolution remains case insensitive, it is now case preserving.** _This PR is a follow up to #2293 (and to a lesser extent #2262 #2334)._ In #2293 the catalog was changed to store analyzed logical plans instead of unresolved ones. While this change fixed the reported bug (which was caused by yet another instance of us forgetting to put in a `LowerCaseSchema` operator) it had the consequence of breaking assumptions made by `MultiInstanceRelation`. Specifically, we can't replace swap out leaf operators in a tree without rewriting changed expression ids (which happens when you self join the same RDD that has been registered as a temp table). In this PR, I instead remove the need to insert `LowerCaseSchema` operators at all, by moving the concern of matching up identifiers completely into analysis. Doing so allows the test cases from both #2293 and #2262 to pass at the same time (and likely fixes a slew of other "unknown unknown" bugs). While it is rolled back in this PR, storing the analyzed plan might actually be a good idea. For instance, it is kind of confusing if you register a temporary table, change the case sensitivity of resolution and now you can't query that table anymore. This can be addressed in a follow up PR. Follow-ups: - Configurable case sensitivity - Consider storing analyzed plans for temp tables Author: Michael Armbrust <michael@databricks.com> Closes #2382 from marmbrus/lowercase and squashes the following commits: c21171e [Michael Armbrust] Ensure the resolver is used for field lookups and ensure that case insensitive resolution is still case preserving. d4320f1 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into lowercase 2de881e [Michael Armbrust] Address comments. 219805a [Michael Armbrust] style 5b93711 [Michael Armbrust] Replace LowerCaseSchema with Resolver.
* [SPARK-3609][SQL] Adds sizeInBytes statistics for Limit operator when all ↵Cheng Lian2014-09-204-5/+45
| | | | | | | | | | | | | | | output attributes are of native data types This helps to replace shuffled hash joins with broadcast hash joins in some cases. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2468 from liancheng/more-stats and squashes the following commits: 32687dc [Cheng Lian] Moved the test case to PlannerSuite 5595a91 [Cheng Lian] Removes debugging code 73faf69 [Cheng Lian] Test case for auto choosing broadcast hash join f30fe1d [Cheng Lian] Adds sizeInBytes estimation for Limit when all output types are native types
* SPARK-3574. Shuffle finish time always reported as -1Sandy Ryza2014-09-204-12/+0
| | | | | | | | | | | The included test waits 100 ms after job completion for task completion events to come in so it can verify they have reasonable finish times. Does anyone know a better way to wait on listener events that are expected to come in? Author: Sandy Ryza <sandy@cloudera.com> Closes #2440 from sryza/sandy-spark-3574 and squashes the following commits: c81439b [Sandy Ryza] Fix test failure b340956 [Sandy Ryza] SPARK-3574. Remove shuffleFinishTime metric