aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT.Marcelo Vanzin2015-03-2033-33/+47
| | | | | | | | | | | | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5056 from vanzin/SPARK-6371 and squashes the following commits: 63220df [Marcelo Vanzin] Merge branch 'master' into SPARK-6371 6506f75 [Marcelo Vanzin] Use more fine-grained exclusion. 178ba71 [Marcelo Vanzin] Oops. 75b2375 [Marcelo Vanzin] Exclude VertexRDD in MiMA. a45a62c [Marcelo Vanzin] Work around MIMA warning. 1d8a670 [Marcelo Vanzin] Re-group jetty exclusion. 0e8e909 [Marcelo Vanzin] Ignore ml, don't ignore graphx. cef4603 [Marcelo Vanzin] Indentation. 296cf82 [Marcelo Vanzin] [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT.
* [SPARK-6426][Doc]User could also point the yarn cluster config directory via ↵WangTaoTheTonic2015-03-201-2/+2
| | | | | | | | | | | | | | YARN_CONF_DI... ...R https://issues.apache.org/jira/browse/SPARK-6426 Author: WangTaoTheTonic <wangtao111@huawei.com> Closes #5103 from WangTaoTheTonic/SPARK-6426 and squashes the following commits: e6dd78d [WangTaoTheTonic] User could also point the yarn cluster config directory via YARN_CONF_DIR
* [SPARK-6370][core] Documentation: Improve all 3 docs for RDD.samplembonaci2015-03-203-0/+23
| | | | | | | | | | The docs for the `sample` method were insufficient, now less so. Author: mbonaci <mbonaci@gmail.com> Closes #5097 from mbonaci/master and squashes the following commits: a6a9d97 [mbonaci] [SPARK-6370][core] Documentation: Improve all 3 docs for RDD.sample method
* [SPARK-6428][MLlib] Added explicit type for public methods and implemented ↵Reynold Xin2015-03-2023-61/+101
| | | | | | | | | | | | | hashCode when equals is defined. I want to add a checker to turn public type checking on, since future pull requests can accidentally expose a non-public type. This is the first cleanup task. Author: Reynold Xin <rxin@databricks.com> Closes #5102 from rxin/mllib-hashcode-publicmethodtypes and squashes the following commits: 617f19e [Reynold Xin] Fixed Scala compilation error. 52bc2d5 [Reynold Xin] [MLlib] Added explicit type for public methods and implemented hashCode when equals is defined.
* SPARK-6338 [CORE] Use standard temp dir mechanisms in tests to avoid ↵Sean Owen2015-03-2033-153/+116
| | | | | | | | | | | | | | | orphaned temp files Use `Utils.createTempDir()` to replace other temp file mechanisms used in some tests, to further ensure they are cleaned up, and simplify Author: Sean Owen <sowen@cloudera.com> Closes #5029 from srowen/SPARK-6338 and squashes the following commits: 27b740a [Sean Owen] Fix hive-thriftserver tests that don't expect an existing dir 4a212fa [Sean Owen] Standardize a bit more temp dir management 9004081 [Sean Owen] Revert some added recursive-delete calls 57609e4 [Sean Owen] Use Utils.createTempDir() to replace other temp file mechanisms used in some tests, to further ensure they are cleaned up, and simplify
* SPARK-5134 [BUILD] Bump default Hadoop version to 2+Sean Owen2015-03-201-1/+1
| | | | | | | | | | Bump default Hadoop version to 2.2.0. (This is already the dependency version reported by published Maven artifacts.) See JIRA for further discussion. Author: Sean Owen <sowen@cloudera.com> Closes #5027 from srowen/SPARK-5134 and squashes the following commits: acbee14 [Sean Owen] Bump default Hadoop version to 2.2.0. (This is already the dependency version reported by published Maven artifacts.)
* [SPARK-6286][Mesos][minor] Handle missing Mesos case TASK_ERRORJongyoul Lee2015-03-203-2/+5
| | | | | | | | | | | | - Made TaskState.isFailed for handling TASK_LOST and TASK_ERROR and synchronizing CoarseMesosSchedulerBackend and MesosSchedulerBackend - This is related #5000 Author: Jongyoul Lee <jongyoul@gmail.com> Closes #5088 from jongyoul/SPARK-6286-1 and squashes the following commits: 4f2362f [Jongyoul Lee] [SPARK-6286][Mesos][minor] Handle missing Mesos case TASK_ERROR - Fixed scalastyle ac4336a [Jongyoul Lee] [SPARK-6286][Mesos][minor] Handle missing Mesos case TASK_ERROR - Made TaskState.isFailed for handling TASK_LOST and TASK_ERROR and synchronizing CoarseMesosSchedulerBackend and MesosSchedulerBackend
* Tighten up field/method visibility in Executor and made some code more clear ↵Reynold Xin2015-03-195-106/+120
| | | | | | | | | | | | | to read. I was reading Executor just now and found that some latest changes introduced some weird code path with too much monadic chaining and unnecessary fields. I cleaned it up a bit, and also tightened up the visibility of various fields/methods. Also added some inline documentation to help understand this code better. Author: Reynold Xin <rxin@databricks.com> Closes #4850 from rxin/executor and squashes the following commits: 866fc60 [Reynold Xin] Code review feedback. 020efbb [Reynold Xin] Tighten up field/method visibility in Executor and made some code more clear to read.
* [SPARK-6219] [Build] Check that Python code compilesNicholas Chammas2015-03-192-19/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR expands the Python lint checks so that they check for obvious compilation errors in our Python code. For example: ``` $ ./dev/lint-python Python lint checks failed. Compiling ./ec2/spark_ec2.py ... File "./ec2/spark_ec2.py", line 618 return (master_nodes,, slave_nodes) ^ SyntaxError: invalid syntax ./ec2/spark_ec2.py:618:25: E231 missing whitespace after ',' ./ec2/spark_ec2.py:1117:101: E501 line too long (102 > 100 characters) ``` This PR also bumps up the version of `pep8`. It ignores new types of checks introduced by that version bump while fixing problems missed by the older version of `pep8` we were using. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #4941 from nchammas/compile-spark-ec2 and squashes the following commits: 75e31d8 [Nicholas Chammas] upgrade pep8 + check compile b33651c [Nicholas Chammas] PEP8 line length
* [Core][minor] remove unused `visitedStages` in `DAGScheduler.stageDependsOn`Wenchen Fan2015-03-191-2/+0
| | | | | | | | | | We define and update `visitedStages` in `DAGScheduler.stageDependsOn`, but never read it. So we can safely remove it. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #5086 from cloud-fan/minor and squashes the following commits: 24663ea [Wenchen Fan] remove un-used variable
* [SPARK-5313][Project Infra]: Create simple framework for highlighting ↵Brennon York2015-03-193-44/+135
| | | | | | | | | | | | | | | | | | | | | | | | | | | changes introduced in a PR Built a simple framework with a `dev/tests` directory to house all pull request related tests. I've moved the two original tests (`pr_merge_ability` and `pr_public_classes`) into the new `dev/tests` directory and tested to the best of my ability. At this point I need to test against Jenkins actually running the new `run-tests-jenkins` script to ensure things aren't broken down the path. Author: Brennon York <brennon.york@capitalone.com> Closes #5072 from brennonyork/SPARK-5313 and squashes the following commits: 8ae990c [Brennon York] added dev/run-tests back, removed echo 5db4ed4 [Brennon York] removed the git checkout 1b50050 [Brennon York] adding echos to see what jenkins is seeing b823959 [Brennon York] removed run-tests to further test the public_classes pr test 2b9ce12 [Brennon York] added the dev/run-tests call back in ffd49c0 [Brennon York] remove -c from bash as that was removing the trailing args 735d615 [Brennon York] removed the actual dev/run-tests command to further test jenkins d579662 [Brennon York] Merge remote-tracking branch 'upstream/master' into SPARK-5313 aa48029 [Brennon York] removed echo lines for testing jenkins 24cd965 [Brennon York] added test output to check within jenkins to verify 3a38e73 [Brennon York] removed the temporary read 9c881ff [Brennon York] updated test suite 183b7ee [Brennon York] added documentation on how to create tests 0bc2efe [Brennon York] ensure each test starts on the current pr branch 1743378 [Brennon York] added tests in test suite abd7430 [Brennon York] updated to include test suite
* [SPARK-6291] [MLLIB] GLM toString & toDebugStringYanbo Liang2015-03-193-1/+14
| | | | | | | | | | | | | | GLM toString prints out intercept, numFeatures. For LogisticRegression and SVM model, toString also prints out numClasses, threshold. GLM toDebugString prints out the whole weights, intercept. Author: Yanbo Liang <ybliang8@gmail.com> Closes #5038 from yanboliang/spark-6291 and squashes the following commits: 2f578b0 [Yanbo Liang] code format 78b33f2 [Yanbo Liang] fix typos 1e8a023 [Yanbo Liang] GLM toString & toDebugString
* [SPARK-5843] [API] Allowing map-side combine to be specified in Java.mcheah2015-03-192-12/+87
| | | | | | | | | | | | | | | Specifically, when calling JavaPairRDD.combineByKey(), there is a new six-parameter method that exposes the map-side-combine boolean as the fifth parameter and the serializer as the sixth parameter. Author: mcheah <mcheah@palantir.com> Closes #4634 from mccheah/pair-rdd-map-side-combine and squashes the following commits: 5c58319 [mcheah] Fixing compiler errors. 3ce7deb [mcheah] Addressing style and documentation comments. 7455c7a [mcheah] Allowing Java combineByKey to specify Serializer as well. 6ddd729 [mcheah] [SPARK-5843] Allowing map-side combine to be specified in Java.
* [SPARK-6402][DOC] - Remove some refererences to shark in docs and ec2Pierre Borckmans2015-03-193-6/+3
| | | | | | | | | | | | | | | EC2 script and job scheduling documentation still refered to Shark. I removed these references. I also removed a remaining `SHARK_VERSION` variable from `ec2-variables.sh`. Author: Pierre Borckmans <pierre.borckmans@realimpactanalytics.com> Closes #5083 from pierre-borckmans/remove_refererences_to_shark_in_docs and squashes the following commits: 4e90ffc [Pierre Borckmans] Removed deprecated SHARK_VERSION caea407 [Pierre Borckmans] Remove shark reference from ec2 script doc 196c744 [Pierre Borckmans] Removed references to Shark
* [SPARK-4012] stop SparkContext when the exception is thrown from an infinite ↵CodingCat2015-03-189-16/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | loop https://issues.apache.org/jira/browse/SPARK-4012 This patch is a resubmission for https://github.com/apache/spark/pull/2864 What I am proposing in this patch is that ***when the exception is thrown from an infinite loop, we should stop the SparkContext, instead of let JVM throws exception forever*** So, in the infinite loops where we originally wrapped with a ` logUncaughtExceptions`, I changed to `tryOrStopSparkContext`, so that the Spark component is stopped Early stopped JVM process is helpful for HA scheme design, for example, The user has a script checking the existence of the pid of the Spark Streaming driver for monitoring the availability; with the code before this patch, the JVM process is still available but not functional when the exceptions are thrown andrewor14, srowen , mind taking further consideration about the change? Author: CodingCat <zhunansjtu@gmail.com> Closes #5004 from CodingCat/SPARK-4012-1 and squashes the following commits: 589276a [CodingCat] throw fatal error again 3c72cd8 [CodingCat] address the comments 6087864 [CodingCat] revise comments 6ad3eb0 [CodingCat] stop SparkContext instead of quit the JVM process 6322959 [CodingCat] exit JVM process when the exception is thrown from an infinite loop
* [SPARK-6222][Streaming] Dont delete checkpoint data when doing ↵Tathagata Das2015-03-193-12/+153
| | | | | | | | | | | | | | | | | | | | | pre-batch-start checkpoint This is another alternative approach to https://github.com/apache/spark/pull/4964/ I think this is a simpler fix that can be backported easily to other branches (1.2 and 1.3). All it does it introduce a flag so that the pre-batch-start checkpoint does not call clear checkpoint. There is not unit test yet. I will add it when this approach is commented upon. Not sure if this is testable easily. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #5008 from tdas/SPARK-6222 and squashes the following commits: 7315bc2 [Tathagata Das] Removed empty line. c438de4 [Tathagata Das] Revert unnecessary change. 5e98374 [Tathagata Das] Added unit test 50cb60b [Tathagata Das] Fixed style issue 295ca5c [Tathagata Das] Fixing SPARK-6222
* [SPARK-6394][Core] cleanup BlockManager companion object and improve the ↵Wenchen Fan2015-03-183-23/+22
| | | | | | | | | | | | | | | | | getCacheLocs method in DAGScheduler The current implementation include searching a HashMap many times, we can avoid this. Actually if you look into `BlockManager.blockIdsToBlockManagers`, the core function call is [this](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1258), so we can call `blockManagerMaster.getLocations` directly and avoid building a HashMap. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #5043 from cloud-fan/small and squashes the following commits: e959d12 [Wenchen Fan] fix style 203c493 [Wenchen Fan] some cleanup in BlockManager companion object d409099 [Wenchen Fan] address rxin's comment faec999 [Wenchen Fan] add regression test 2fb57aa [Wenchen Fan] imporve the getCacheLocs method
* SPARK-6085 Part. 2 Increase default value for memory overheadJongyoul Lee2015-03-181-3/+3
| | | | | | | | | | | | - fixed a description of spark.mesos.executor.memoryOverhead from 7% to 10% - This is a second part of SPARK-6085 Author: Jongyoul Lee <jongyoul@gmail.com> Closes #5065 from jongyoul/SPARK-6085-1 and squashes the following commits: c5af84c [Jongyoul Lee] SPARK-6085 Part. 2 Increase default value for memory overhead - Changed "MiB" to "MB" dbac1c0 [Jongyoul Lee] SPARK-6085 Part. 2 Increase default value for memory overhead - fixed a description of spark.mesos.executor.memoryOverhead from 7% to 10%
* [SPARK-6374] [MLlib] add get for GeneralizedLinearAlgoYuhao Yang2015-03-181-0/+10
| | | | | | | | | | I find it's better to have getter for NumFeatures and addIntercept within GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get the value through debug. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #5058 from hhbyyh/addGetLinear and squashes the following commits: 9dc90e8 [Yuhao Yang] add get for GeneralizedLinearAlgo
* [SPARK-6325] [core,yarn] Do not change target executor count when killing ↵Marcelo Vanzin2015-03-184-6/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | executors. The dynamic execution code has two ways to reduce the number of executors: one where it reduces the total number of executors it wants, by asking for an absolute number of executors that is lower than the previous one. The second is by explicitly killing idle executors. YarnAllocator was mixing those up and lowering the target number of executors when a kill was issued. Instead, trust the frontend knows what it's doing, and kill executors without messing with other accounting. That means that if the frontend kills an executor without lowering the target, it will get a new executor shortly. The one situation where both actions (lower the target and kill executor) need to happen together is when user code explicitly calls `SparkContext.killExecutors`. In that case, issue two calls to the backend to achieve the goal. I also did some minor cleanup in related code: - avoid sending a request for executors when target is unchanged, to avoid log spam in the AM - avoid printing misleading log messages in the AM when there are no requests to cancel - fix a slow memory leak plus misleading error message on the driver caused by failing to completely unregister the executor. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5018 from vanzin/SPARK-6325 and squashes the following commits: 2e782a3 [Marcelo Vanzin] Avoid redundant logging on the AM side. a3567cd [Marcelo Vanzin] Add parentheses. a363926 [Marcelo Vanzin] Update logic. a158101 [Marcelo Vanzin] [SPARK-6325] [core,yarn] Disallow reducing executor count past running count.
* [SPARK-6286][minor] Handle missing Mesos case TASK_ERROR.Iulian Dragos2015-03-183-19/+4
| | | | | | | | | Author: Iulian Dragos <jaguarul@gmail.com> Closes #5000 from dragos/issue/task-error-case and squashes the following commits: e063627 [Iulian Dragos] Handle TASK_ERROR in Mesos scheduler backends. ac17cf0 [Iulian Dragos] Handle missing Mesos case TASK_ERROR.
* SPARK-6389 YARN app diagnostics report doesn't report NPEsSteve Loughran2015-03-181-3/+3
| | | | | | | | | | | | | Trivial patch to implicitly call `Exception.toString()` over `Exception.getMessage()` —this defaults to including the exception class & any non-null message; some subclasses include more. No test. Author: Steve Loughran <stevel@hortonworks.com> Closes #5070 from steveloughran/stevel/patches/SPARK-6389-NPE-reporting and squashes the following commits: 8239d85 [Steve Loughran] SPARK-6389 cull use of getMessage over toString in the container launcher 6fbaf6a [Steve Loughran] SPARK-6389 YARN app diagnostics report doesn't report NPEs
* [SPARK-6372] [core] Propagate --conf to child processes.Marcelo Vanzin2015-03-182-9/+5
| | | | | | | | | | And add unit test. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5057 from vanzin/SPARK-6372 and squashes the following commits: b33728b [Marcelo Vanzin] [SPARK-6372] [core] Propagate --conf to child processes.
* [SPARK-6247][SQL] Fix resolution of ambiguous joins caused by new aliasesMichael Armbrust2015-03-179-12/+96
| | | | | | | | | | | | | | | | | | | | | | | We need to handle ambiguous `exprId`s that are produced by new aliases as well as those caused by leaf nodes (`MultiInstanceRelation`). Attempting to fix this revealed a bug in `equals` for `Alias` as these objects were comparing equal even when the expression ids did not match. Additionally, `LocalRelation` did not correctly provide statistics, and some tests in `catalyst` and `hive` were not using the helper functions for comparing plans. Based on #4991 by chenghao-intel Author: Michael Armbrust <michael@databricks.com> Closes #5062 from marmbrus/selfJoins and squashes the following commits: 8e9b84b [Michael Armbrust] check qualifier too 8038a36 [Michael Armbrust] handle aggs too 0b9c687 [Michael Armbrust] fix more tests c3c574b [Michael Armbrust] revert change. 725f1ab [Michael Armbrust] add statistics a925d08 [Michael Armbrust] check for conflicting attributes in join resolution b022ef7 [Michael Armbrust] Handle project aliases. d8caa40 [Michael Armbrust] test case: SPARK-6247 f9c67c2 [Michael Armbrust] Check for duplicate attributes in join resolution. 898af73 [Michael Armbrust] Fix Alias equality.
* [SPARK-5651][SQL] Add input64 in blacklist and add test suit for create ↵watermen2015-03-176-1/+513
| | | | | | | | | | | | | | | | | | table within backticks Now spark version is only support ```create table table_in_database_creation.test1 as select * from src limit 1;``` in HiveContext. This patch is used to support ```create table `table_in_database_creation.test2` as select * from src limit 1;``` in HiveContext. Author: watermen <qiyadong2010@gmail.com> Author: q00251598 <qiyadong@huawei.com> Closes #4427 from watermen/SPARK-5651 and squashes the following commits: c5c8ed1 [watermen] add the generated golden files 1f0e42e [q00251598] add input64 in blacklist and add test suit
* [SPARK-5404] [SQL] Update the default statistic numberCheng Hao2015-03-171-1/+11
| | | | | | | | | | | | | By default, the statistic for logical plan with multiple children is quite aggressive, and those statistic are quite critical for the join optimization, hence we need to estimate the statistics as accurate as possible. For `Union`, which has 2 children, and overwrite the default implementation by `adding` its children `byteInSize` instead of `multiplying`. For `Expand`, which only has a single child, but it will grows the size, and we need to multiply its inflating factor. Author: Cheng Hao <hao.cheng@intel.com> Closes #4914 from chenghao-intel/statistic and squashes the following commits: d466bbc [Cheng Hao] Update the default statistic
* [SPARK-5908][SQL] Resolve UdtfsAlias when only single Alias is usedLiang-Chi Hsieh2015-03-172-0/+9
| | | | | | | | | | | | | `ResolveUdtfsAlias` in `hiveUdfs` only considers the `HiveGenericUdtf` with multiple alias. When only single alias is used with `HiveGenericUdtf`, the alias is not working. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4692 from viirya/udft_alias and squashes the following commits: 8a3bae4 [Liang-Chi Hsieh] No need to test selected column from DataFrame since DataFrame API is updated. 160a379 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into udft_alias e6531cc [Liang-Chi Hsieh] Selected column from DataFrame should not re-analyze logical plan. a45cc2a [Liang-Chi Hsieh] Resolve UdtfsAlias when only single Alias is used.
* [SPARK-6383][SQL]Fixed compiler and errors in Dataframe examplesTijo Thomas2015-03-171-6/+6
| | | | | | | | | Author: Tijo Thomas <tijoparacka@gmail.com> Closes #5068 from tijoparacka/fix_sql_dataframe_example and squashes the following commits: 6953ac1 [Tijo Thomas] Handled Java and Python example sections 0751a74 [Tijo Thomas] Fixed compiler and errors in Dataframe examples
* [SPARK-6366][SQL] In Python API, the default save mode for save and ↵Yin Huai2015-03-181-2/+2
| | | | | | | | | | | | saveAsTable should be "error" instead of "append". https://issues.apache.org/jira/browse/SPARK-6366 Author: Yin Huai <yhuai@databricks.com> Closes #5053 from yhuai/SPARK-6366 and squashes the following commits: fc81897 [Yin Huai] Use error as the default save mode for save/saveAsTable.
* [SPARK-6330] [SQL] Add a test case for SPARK-6330Pei-Lun Lee2015-03-181-0/+13
| | | | | | | | | | | | | | | When getting file statuses, create file system from each path instead of a single one from hadoop configuration. Author: Pei-Lun Lee <pllee@appier.com> Closes #5039 from ypcat/spark-6351 and squashes the following commits: a19a3fe [Pei-Lun Lee] [SPARK-6330] [SQL] fix test 506f5a0 [Pei-Lun Lee] [SPARK-6351] [SQL] fix test fa2290e [Pei-Lun Lee] [SPARK-6330] [SQL] Rename test case and add comment 606c967 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6351 896e80a [Pei-Lun Lee] [SPARK-6351] [SQL] Add test case 2ae0916 [Pei-Lun Lee] [SPARK-6351] [SQL] ParquetRelation2 supporting multiple file systems
* [SPARK-6226][MLLIB] add save/load in PySpark's KMeansModelXiangrui Meng2015-03-173-5/+32
| | | | | | | | | | | Use `_py2java` and `_java2py` to convert Python model to/from Java model. yinxusen Author: Xiangrui Meng <meng@databricks.com> Closes #5049 from mengxr/SPARK-6226-mengxr and squashes the following commits: 570ba81 [Xiangrui Meng] fix python style b10b911 [Xiangrui Meng] add save/load in PySpark's KMeansModel
* [SPARK-6336] LBFGS should document what convergenceTol meanslewuathe2015-03-172-1/+9
| | | | | | | | | | | | LBFGS uses convergence tolerance. This value should be written in document as an argument. Author: lewuathe <lewuathe@me.com> Closes #5033 from Lewuathe/SPARK-6336 and squashes the following commits: e738b33 [lewuathe] Modify text to be more natural ac03c3a [lewuathe] Modify documentations 6ccb304 [lewuathe] [SPARK-6336] LBFGS should document what convergenceTol means
* [SPARK-6313] Add config option to disable file locks/fetchFile cache to ...nemccarthy2015-03-172-1/+14
| | | | | | | | | | | | | | ...support NFS mounts. This is a work around for now with the goal to find a more permanent solution. https://issues.apache.org/jira/browse/SPARK-6313 Author: nemccarthy <nathan@nemccarthy.me> Closes #5036 from nemccarthy/master and squashes the following commits: 2eaaf42 [nemccarthy] [SPARK-6313] Update config wording doc for spark.files.useFetchCache 5de7eb4 [nemccarthy] [SPARK-6313] Add config option to disable file locks/fetchFile cache to support NFS mounts
* [SPARK-3266] Use intermediate abstract classes to fix type erasure issues in ↵Josh Rosen2015-03-178-5/+152
| | | | | | | | | | | | | | | | | Java APIs This PR addresses a Scala compiler bug ([SI-8905](https://issues.scala-lang.org/browse/SI-8905)) that was breaking some of the Spark Java APIs. In a nutshell, it seems that methods whose implementations are inherited from generic traits sometimes have their type parameters erased to Object. This was causing methods like `DoubleRDD.min()` to throw confusing NoSuchMethodErrors at runtime. The fix implemented here is to introduce an intermediate layer of abstract classes and inherit from those instead of directly extends the `Java*Like` traits. This should not break binary compatibility. I also improved the test coverage of the Java API, adding several new tests for methods that failed at runtime due to this bug. Author: Josh Rosen <joshrosen@databricks.com> Closes #5050 from JoshRosen/javardd-si-8905-fix and squashes the following commits: 2feb068 [Josh Rosen] Use intermediate abstract classes to work around SPARK-3266 d5f3e5d [Josh Rosen] Add failing regression tests for SPARK-3266
* [SPARK-6365] jetty-security also needed for SPARK_PREPEND_CLASSES to workImran Rashid2015-03-171-1/+1
| | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-6365 thanks vanzin for helping me figure this out Author: Imran Rashid <irashid@cloudera.com> Closes #5052 from squito/fix_prepend_classes and squashes the following commits: 09d334c [Imran Rashid] jetty-security also needed for SPARK_PREPEND_CLASSES to work
* [SPARK-6331] Load new master URL if present when recovering streaming ↵Tathagata Das2015-03-174-7/+25
| | | | | | | | | | | | | | | | | context from checkpoint In streaming driver recovery, when the SparkConf is reconstructed based on the checkpointed configuration, it recovers the old master URL. This okay if the cluster on which the streaming application is relaunched is the same cluster as it was running before. But if that cluster changes, there is no way to inject the new master URL of the new cluster. As a result, the restarted app tries to connect to the non-existent old cluster and fails. The solution is to check whether a master URL is set in the System properties (by Spark submit) before recreating the SparkConf. If a new master url is set in the properties, then use it as that is obviously the most relevant one. Otherwise load the old one (to maintain existing behavior). Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #5024 from tdas/SPARK-6331 and squashes the following commits: 392fd44 [Tathagata Das] Fixed naming issue. c7c0b99 [Tathagata Das] Addressed comments. 6a0857c [Tathagata Das] Updated testsuites. 222485d [Tathagata Das] Load new master URL if present when recovering streaming context from checkpoint
* [docs] [SPARK-4820] Spark build encounters "File name too long" on some ↵Theodore Vasiloudis2015-03-171-0/+12
| | | | | | | | | | | | encrypted filesystems Added a note instructing users how to build Spark in an encrypted file system. Author: Theodore Vasiloudis <tvas@sics.se> Closes #5041 from thvasilo/patch-2 and squashes the following commits: 09d890b [Theodore Vasiloudis] Workaroung for buiding in an encrypted filesystem
* [SPARK-6269] [CORE] Use ScalaRunTime's array methods instead of ↵mcheah2015-03-171-13/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | java.lang.reflect.Array in size estimation This patch switches the usage of java.lang.reflect.Array in Size estimation to using scala's RunTime array-getter methods. The notes on https://bugs.openjdk.java.net/browse/JDK-8051447 tipped me off to the fact that using java.lang.reflect.Array was not ideal. At first, I used the code from that ticket, but it turns out that ScalaRunTime's array-related methods avoid the bottleneck of invoking native code anyways, so that was sufficient to boost performance in size estimation. The idea is to use pure Java code in implementing the methods there, as opposed to relying on native C code which ends up being ill-performing. This improves the performance of estimating the size of arrays when we are checking for spilling in Spark. Here's the benchmark discussion from the ticket: I did two tests. The first, less convincing, take-with-a-block-of-salt test I did was do a simple groupByKey operation to collect objects in a 4.0 GB text file RDD into 30,000 buckets. I ran 1 Master and 4 Spark Worker JVMs on my mac, fetching the RDD from a text file simply stored on disk, and saving it out to another file located on local disk. The wall clock times I got back before and after the change were: Before: 352.195s, 343.871s, 359.080s After (using code directly from the JDK ticket, not the scala code in this PR): 342.929583s, 329.456623s, 326.151481s So, there is a bit of an improvement after the change. I also did some YourKit profiling of the executors to get an idea of how much time was spent in size estimation before and after the change. I roughly saw that size estimation took up less of the time after my change, but YourKit's profiling can be inconsistent and who knows if I was profiling the executors that had the same data between runs? The more convincing test I did was to run the size-estimation logic itself in an isolated unit test. I ran the following code: ``` val bigArray = Array.fill(1000)(Array.fill(1000)(java.util.UUID.randomUUID().toString())) test("String arrays only perf testing") { val startTime = System.currentTimeMillis() for (i <- 1 to 50000) { SizeEstimator.estimate(bigArray) } println("Runtime: " + (System.currentTimeMillis() - startTime) / 1000.0000) } ``` I wanted to use a 2D array specifically because I wanted to measure the performance of repeatedly calling Array.getLength. I used UUID-Strings to ensure that the strings were randomized (so String object re-use doesn't happen), but that they would all be the same size. The results were as follows: Before PR: 222.681 s, 218.34 s, 211.739s After latest change: 170.715 s, 176.775 s, 180.298 s . Author: mcheah <mcheah@palantir.com> Author: Justin Uang <justin.uang@gmail.com> Closes #4972 from mccheah/feature/spark-6269-reflect-array and squashes the following commits: 8527852 [mcheah] Respect CamelCase for numElementsDrawn 18d4b50 [mcheah] Addressing style comments - while loops instead of for loops 16ce534 [mcheah] Organizing imports properly db890ea [mcheah] Removing CastedArray and just using ScalaRunTime. cb67ce2 [mcheah] Fixing a scalastyle error - line too long 5d53c4c [mcheah] Removing unused parameter in visitArray. 6467759 [mcheah] Including primitive size information inside CastedArray. 93f4b05 [mcheah] Using Scala instead of Java for the array-reflection implementation. a557ab8 [mcheah] Using a wrapper around arrays to do casting only once ca063fc [mcheah] Fixing a compiler error made while refactoring style 1fe09de [Justin Uang] [SPARK-6269] Use a different implementation of java.lang.reflect.Array
* [SPARK-4011] tighten the visibility of the members in Master/Worker classCodingCat2015-03-1749-265/+277
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-4011 Currently, most of the members in Master/Worker are with public accessibility. We might wish to tighten the accessibility of them a bit more discussion is here: https://github.com/apache/spark/pull/2828 Author: CodingCat <zhunansjtu@gmail.com> Closes #4844 from CodingCat/SPARK-4011 and squashes the following commits: 1a64175 [CodingCat] fix compilation issue e7fd375 [CodingCat] Sean is right.... f5034a4 [CodingCat] fix rebase mistake 8d5b0c0 [CodingCat] loose more fields 0072f96 [CodingCat] lose some restrictions based on the possible design intention de77286 [CodingCat] tighten accessibility of deploy package 12b4fd3 [CodingCat] tighten accessibility of deploy.worker 1243bc7 [CodingCat] tighten accessibility of deploy.rest c5f622c [CodingCat] tighten the accessibility of deploy.history d441e20 [CodingCat] tighten accessibility of deploy.client 4e0ce4a [CodingCat] tighten the accessibility of the members of classes in master 23cddbb [CodingCat] stylistic fix 9a3a340 [CodingCat] tighten the access of worker class 67a0559 [CodingCat] tighten the access permission in Master
* SPARK-6044 [CORE] RDD.aggregate() should not use the closure serializer on ↵Sean Owen2015-03-161-1/+1
| | | | | | | | | | | | | | the zero value Use configured serializer in RDD.aggregate, to match PairRDDFunctions.aggregateByKey, instead of closure serializer. Compare with https://github.com/apache/spark/blob/e60ad2f4c47b011be7a3198689ac2b82ee317d96/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L127 Author: Sean Owen <sowen@cloudera.com> Closes #5028 from srowen/SPARK-6044 and squashes the following commits: a4040a7 [Sean Owen] Use configured serializer in RDD.aggregate, to match PairRDDFunctions.aggregateByKey, instead of closure serializer
* [SPARK-6357][GraphX] Add unapply in EdgeContextTakeshi YAMAMURO2015-03-161-0/+17
| | | | | | | | | | This extractor is mainly used for Graph#aggregateMessages*. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #5047 from maropu/AddUnapplyInEdgeContext and squashes the following commits: 87e04df [Takeshi YAMAMURO] Add unapply in EdgeContext
* [SQL][docs][minor] Fixed sample code in SQLContext scaladocLomig Mégard2015-03-161-2/+2
| | | | | | | | | | Error in the code sample of the `implicits` object in `SQLContext`. Author: Lomig Mégard <lomig.megard@gmail.com> Closes #5051 from tarfaa/simple and squashes the following commits: 5a88acc [Lomig Mégard] [docs][minor] Fixed sample code in SQLContext scaladoc
* [SPARK-6299][CORE] ClassNotFoundException in standalone mode when running ↵Kevin (Sangwoo) Kim2015-03-163-39/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | groupByKey with class defined in REPL ``` case class ClassA(value: String) val rdd = sc.parallelize(List(("k1", ClassA("v1")), ("k1", ClassA("v2")) )) rdd.groupByKey.collect ``` This code used to be throw exception in spark-shell, because while shuffling ```JavaSerializer```uses ```defaultClassLoader``` which was defined like ```env.serializer.setDefaultClassLoader(urlClassLoader)```. It should be ```env.serializer.setDefaultClassLoader(replClassLoader)```, like ``` override def run() { val deserializeStartTime = System.currentTimeMillis() Thread.currentThread.setContextClassLoader(replClassLoader) ``` in TaskRunner. When ```replClassLoader``` cannot be defined, it's identical with ```urlClassLoader``` Author: Kevin (Sangwoo) Kim <sangwookim.me@gmail.com> Closes #5046 from swkimme/master and squashes the following commits: fa2b9ee [Kevin (Sangwoo) Kim] stylish test codes ( collect -> collect() ) 6e9620b [Kevin (Sangwoo) Kim] stylish test codes ( collect -> collect() ) d23e4e2 [Kevin (Sangwoo) Kim] stylish test codes ( collect -> collect() ) a4a3c8a [Kevin (Sangwoo) Kim] add 'class defined in repl - shuffle' test to ReplSuite bd00da5 [Kevin (Sangwoo) Kim] add 'class defined in repl - shuffle' test to ReplSuite c1b1fc7 [Kevin (Sangwoo) Kim] use REPL class loader for executor's serializer
* [SPARK-5712] [SQL] fix comment with semicolon at endDaoyuan Wang2015-03-174-13/+19
| | | | | | | | | | | | | | ---- comment; Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4500 from adrian-wang/semicolon and squashes the following commits: 70b8abb [Daoyuan Wang] use mkstring instead of reduce 2d49738 [Daoyuan Wang] remove outdated golden file 317346e [Daoyuan Wang] only skip comment with semicolon at end of line, to avoid golden file outdated d3ae01e [Daoyuan Wang] fix error a11602d [Daoyuan Wang] fix comment with semicolon at end
* [SPARK-6327] [PySpark] fix launch spark-submit from pythonDavies Liu2015-03-162-5/+2
| | | | | | | | | | | | SparkSubmit should be launched without setting PYSPARK_SUBMIT_ARGS cc JoshRosen , this mode is actually used by python unit test, so I will not add more test for it. Author: Davies Liu <davies@databricks.com> Closes #5019 from davies/fix_submit and squashes the following commits: 2c20b0c [Davies Liu] fix launch spark-submit from python
* [SPARK-6077] Remove streaming tab while stopping StreamingContextlisurprise2015-03-169-101/+179
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we would create a new streaming tab for each streamingContext even if there's already one on the same sparkContext which would cause duplicate StreamingTab created and none of them is taking effect. snapshot: https://www.dropbox.com/s/t4gd6hqyqo0nivz/bad%20multiple%20streamings.png?dl=0 How to reproduce: 1) import org.apache.spark.SparkConf import org.apache.spark.streaming. {Seconds, StreamingContext} import org.apache.spark.storage.StorageLevel val ssc = new StreamingContext(sc, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ..... 2) ssc.stop(false) val ssc = new StreamingContext(sc, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() Author: lisurprise <zhichao.li@intel.com> Closes #4828 from zhichao-li/master and squashes the following commits: c329806 [lisurprise] add test for attaching/detaching streaming tab 51e6c7f [lisurprise] move detach method into StreamingTab 31a44fa [lisurprise] add unit test for attaching and detaching new tab db25ed2 [lisurprise] clean code 8281bcb [lisurprise] clean code 193c542 [lisurprise] remove streaming tab while closing streaming context
* [SPARK-6330] Fix filesystem bug in newParquet relationVolodymyr Lyubinets2015-03-161-2/+3
| | | | | | | | | | | If I'm running this locally and my path points to S3, this would currently error out because of incorrect FS. I tested this in a scenario that previously didn't work, this change seemed to fix the issue. Author: Volodymyr Lyubinets <vlyubin@gmail.com> Closes #5020 from vlyubin/parquertbug and squashes the following commits: a645ad5 [Volodymyr Lyubinets] Fix filesystem bug in newParquet relation
* [SPARK-2087] [SQL] Multiple thriftserver sessions with single HiveContext ↵Cheng Hao2015-03-178-105/+353
| | | | | | | | | | | | | | | | | instance Still, we keep only a single HiveContext within ThriftServer, and we also create a object called `SQLSession` for isolating the different user states. Developers can obtain/release a new user session via `openSession` and `closeSession`, and `SQLContext` and `HiveContext` will also provide a default session if no `openSession` called, for backward-compatibility. Author: Cheng Hao <hao.cheng@intel.com> Closes #4885 from chenghao-intel/multisessions_singlecontext and squashes the following commits: 1c47b2a [Cheng Hao] rename the tss => tlSession 815b27a [Cheng Hao] code style issue 57e3fa0 [Cheng Hao] openSession is not compatible between Hive0.12 & 0.13.1 4665b0d [Cheng Hao] thriftservice with single context
* [SPARK-6300][Spark Core] sc.addFile(path) does not support the relative path.DoingDone92015-03-162-15/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | when i run cmd like that sc.addFile("../test.txt"), it did not work and throwed an exception: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:../test.txt at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.<init>(Path.java:172) ........ ....... Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:../test.txt at java.net.URI.checkPath(URI.java:1804) at java.net.URI.<init>(URI.java:752) at org.apache.hadoop.fs.Path.initialize(Path.java:203) Author: DoingDone9 <799203320@qq.com> Closes #4993 from DoingDone9/relativePath and squashes the following commits: ee375cd [DoingDone9] Update SparkContextSuite.scala d594e16 [DoingDone9] Update SparkContext.scala 0ff3fa8 [DoingDone9] test for add file dced8eb [DoingDone9] Update SparkContext.scala e4a13fe [DoingDone9] getCanonicalPath 161cae3 [DoingDone9] Merge pull request #4 from apache/master c87e8b6 [DoingDone9] Merge pull request #3 from apache/master cb1852d [DoingDone9] Merge pull request #2 from apache/master c3f046f [DoingDone9] Merge pull request #1 from apache/master
* [SPARK-5922][GraphX]: Add diff(other: RDD[VertexId, VD]) in VertexRDDBrennon York2015-03-164-0/+29
| | | | | | | | | | | | | | | | | | | Changed method invocation of 'diff' to match that of 'innerJoin' and 'leftJoin' from VertexRDD[VD] to RDD[(VertexId, VD)]. This change maintains backwards compatibility and better unifies the VertexRDD methods to match each other. Author: Brennon York <brennon.york@capitalone.com> Closes #4733 from brennonyork/SPARK-5922 and squashes the following commits: e800f08 [Brennon York] fixed merge conflicts b9274af [Brennon York] fixed merge conflicts f86375c [Brennon York] fixed minor include line 398ddb4 [Brennon York] fixed merge conflicts aac1810 [Brennon York] updated to aggregateUsingIndex and added test to ensure that method works properly 2af0b88 [Brennon York] removed deprecation line 753c963 [Brennon York] fixed merge conflicts and set preference to use the diff(other: VertexRDD[VD]) method 2c678c6 [Brennon York] added mima exclude to exclude new public diff method from VertexRDD 93186f3 [Brennon York] added back the original diff method to sustain binary compatibility f18356e [Brennon York] changed method invocation of 'diff' to match that of 'innerJoin' and 'leftJoin' from VertexRDD[VD] to RDD[(VertexId, VD)]