aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-4050][SQL] Fix caching of temporary tables with projections.Michael Armbrust2014-10-244-12/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously cached data was found by `sameResult` plan matching on optimized plans. This technique however fails to locate the cached data when a temporary table with a projection is queried with a further reduced projection. The failure is due to the fact that optimization will collapse the projections, producing a plan that no longer produces the sameResult as the cached data (though the cached data still subsumes the desired data). For example consider the following previously failing test case. ```scala sql("CACHE TABLE tempTable AS SELECT key FROM testData") assertCached(sql("SELECT COUNT(*) FROM tempTable")) ``` In this PR I change the matching to occur after analysis instead of optimization, so that in the case of temporary tables, the plans will always match. I think this should work generally, however, this error does raise questions about the need to do more thorough subsumption checking when locating cached data. Another question is what sort of semantics we want to provide when uncaching data from temporary tables. For example consider the following sequence of commands: ```scala testData.select('key).registerTempTable("tempTable1") testData.select('key).registerTempTable("tempTable2") cacheTable("tempTable1") // This obviously works. assertCached(sql("SELECT COUNT(*) FROM tempTable1")) // It seems good that this works ... assertCached(sql("SELECT COUNT(*) FROM tempTable2")) // ... but is this valid? uncacheTable("tempTable2") // Should this still be cached? assertCached(sql("SELECT COUNT(*) FROM tempTable1"), 0) ``` Author: Michael Armbrust <michael@databricks.com> Closes #2912 from marmbrus/cachingBug and squashes the following commits: 9c822d4 [Michael Armbrust] remove commented out code 5c72fb7 [Michael Armbrust] Add a test case / question about uncaching semantics. 63a23e4 [Michael Armbrust] Perform caching on analyzed instead of optimized plan. 03f1cfe [Michael Armbrust] Clean-up / add tests to SameResult suite.
* [SPARK-4051] [SQL] [PySpark] Convert Row into dictionaryDavies Liu2014-10-242-0/+21
| | | | | | | | | | | | | | | | Added a method to Row to turn row into dict: ``` >>> row = Row(a=1) >>> row.asDict() {'a': 1} ``` Author: Davies Liu <davies@databricks.com> Closes #2896 from davies/dict and squashes the following commits: 8d97366 [Davies Liu] convert Row into dict
* [SPARK-3900][YARN] ApplicationMaster's shutdown hook fails and ↵Kousuke Saruta2014-10-241-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IllegalStateException is thrown. ApplicationMaster registers a shutdown hook and it calls ApplicationMaster#cleanupStagingDir. cleanupStagingDir invokes FileSystem.get(yarnConf) and it invokes FileSystem.getInternal. FileSystem.getInternal also registers shutdown hook. In FileSystem of hadoop 0.23, the shutdown hook registration does not consider whether shutdown is in progress or not (In 2.2, it's considered). // 0.23 if (map.isEmpty() ) { ShutdownHookManager.get().addShutdownHook(clientFinalizer, SHUTDOWN_HOOK_PRIORITY); } // 2.2 if (map.isEmpty() && !ShutdownHookManager.get().isShutdownInProgress()) { ShutdownHookManager.get().addShutdownHook(clientFinalizer, SHUTDOWN_HOOK_PRIORITY); } Thus, in 0.23, another shutdown hook can be registered when ApplicationMaster's shutdown hook run. This issue cause IllegalStateException as follows. java.lang.IllegalStateException: Shutdown in progress, cannot add a shutdownHook at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:152) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2306) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2278) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:316) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:162) at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:307) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:118) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2924 from sarutak/SPARK-3900-2 and squashes the following commits: 9112817 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900-2 97018fa [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900 2c2850e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900 ee52db2 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900 a7d6c9b [Kousuke Saruta] Merge branch 'SPARK-3900' of github.com:sarutak/spark into SPARK-3900 1cdf03c [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900 a5f6443 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900 57b397d [Kousuke Saruta] Fixed IllegalStateException caused by shutdown hook registration in another shutdown hook
* [SPARK-2652] [PySpark] donot use KyroSerializer as default serializerDavies Liu2014-10-231-1/+0
| | | | | | | | | | | | KyroSerializer can not serialize customized class without registered explicitly, use it as default serializer in PySpark will introduce some regression in MLlib. cc mengxr Author: Davies Liu <davies@databricks.com> Closes #2916 from davies/revert and squashes the following commits: 43eb6d3 [Davies Liu] donot use KyroSerializer as default serializer
* SPARK-3812 Build changes to publish effective pom.Prashant Sharma2014-10-231-1/+32
| | | | | | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Closes #2921 from ScrapCodes/build-changes-effective-pom and squashes the following commits: 8841491 [Prashant Sharma] Fixed broken maven build. aa7b91d [Prashant Sharma] used an unused dep. 0300dac [Prashant Sharma] improved comment messages.. 28f891e [Prashant Sharma] Added a useless dependency, so that we can shade it. And realized fake shading works for us. 553d96b [Prashant Sharma] Shaded some unused class of an unused dep, to generate effective pom(s)
* [SPARK-4000][BUILD] Sends archived unit tests logs to Jenkins masterCheng Lian2014-10-231-12/+41
| | | | | | | | | | | | | | This PR sends archived unit tests logs to the build history directory in Jenkins master, so that we can serve it via HTTP later to help debugging Jenkins build failures. pwendell JoshRosen Please help review, thanks! Author: Cheng Lian <lian@databricks.com> Closes #2845 from liancheng/log-archive and squashes the following commits: ac8d9d4 [Cheng Lian] Includes build number in messages posted to GitHub 68c7010 [Cheng Lian] Logs backup should be implemented in dev/run-tests-jenkins 4b912f7 [Cheng Lian] Sends archived unit tests logs to Jenkins master
* [SPARK-3993] [PySpark] fix bug while reuse worker after take()Davies Liu2014-10-236-5/+44
| | | | | | | | | | | | | | After take(), maybe there are some garbage left in the socket, then next task assigned to this worker will hang because of corrupted data. We should make sure the socket is clean before reuse it, write END_OF_STREAM at the end, and check it after read out all result from python. Author: Davies Liu <davies.liu@gmail.com> Author: Davies Liu <davies@databricks.com> Closes #2838 from davies/fix_reuse and squashes the following commits: 8872914 [Davies Liu] fix tests 660875b [Davies Liu] fix bug while reuse worker after take()
* [SPARK-4019] [SPARK-3740] Fix MapStatus compression bug that could lead to ↵Josh Rosen2014-10-235-34/+109
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | empty results or Snappy errors This commit fixes a bug in MapStatus that could cause jobs to wrongly return empty results if those jobs contained stages with more than 2000 partitions where most of those partitions were empty. For jobs with > 2000 partitions, MapStatus uses HighlyCompressedMapStatus, which only stores the average size of blocks. If the average block size is zero, then this will cause all blocks to be reported as empty, causing BlockFetcherIterator to mistakenly skip them. For example, this would return an empty result: sc.makeRDD(0 until 10, 1000).repartition(2001).collect() This can also lead to deserialization errors (e.g. Snappy decoding errors) for jobs with > 2000 partitions where the average block size is non-zero but there is at least one empty block. In this case, the BlockFetcher attempts to fetch empty blocks and fails when trying to deserialize them. The root problem here is that MapStatus has a (previously undocumented) correctness property that was violated by HighlyCompressedMapStatus: If a block is non-empty, then getSizeForBlock must be non-zero. I fixed this by modifying HighlyCompressedMapStatus to store the average size of _non-empty_ blocks and to use a compressed bitmap to track which blocks are empty. I also removed a test which was broken as originally written: it attempted to check that HighlyCompressedMapStatus's size estimation error was < 10%, but this was broken because HighlyCompressedMapStatus is only used for map statuses with > 2000 partitions, but the test only created 50. Author: Josh Rosen <joshrosen@databricks.com> Closes #2866 from JoshRosen/spark-4019 and squashes the following commits: fc8b490 [Josh Rosen] Roll back hashset change, which didn't improve performance. 5faa0a4 [Josh Rosen] Incorporate review feedback c8b8cae [Josh Rosen] Two performance fixes: 3b892dd [Josh Rosen] Address Reynold's review comments ba2e71c [Josh Rosen] Add missing newline 609407d [Josh Rosen] Use Roaring Bitmap to track non-empty blocks. c23897a [Josh Rosen] Use sets when comparing collect() results 91276a3 [Josh Rosen] [SPARK-4019] Fix MapStatus compression bug that could lead to empty results.
* Revert "[SPARK-3812] [BUILD] Adapt maven build to publish effective pom."Patrick Wendell2014-10-231-32/+1
| | | | | | | This reverts commit c5882c663e054adcd3ecd9f11e91a1929dbc14a3. I am reverting this becuase it appears to cause the maven tests to hang.
* specify unidocGenjavadocVersion of 0.8Holden Karau2014-10-231-0/+2
| | | | | | | | | | Fixes an issue with being too strict generating javadoc causing errors. Author: Holden Karau <holden@pigscanfly.ca> Closes #2893 from holdenk/SPARK-3359-sbtunidoc-java8 and squashes the following commits: 9379a70 [Holden Karau] specify unidocGenjavadocVersion of 0.8
* [SPARK-4006] In long running contexts, we encountered the situation of ↵Tal Sliwowicz2014-10-231-12/+13
| | | | | | | | | | | | | | | | | | | | | | | | double registe... ...r without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. This is just like https://github.com/apache/spark/pull/2854 except it's on master Author: Tal Sliwowicz <tal.s@taboola.com> Closes #2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits: 094d508 [Tal Sliwowicz] some more white space change undone 41a2217 [Tal Sliwowicz] some more whitspaces change undone 7bcfc3d [Tal Sliwowicz] whitspaces fix df9d98f [Tal Sliwowicz] Code review comments fixed f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.
* [SPARK-4055][MLlib] Inconsistent spelling 'MLlib' and 'MLLib'Kousuke Saruta2014-10-235-5/+5
| | | | | | | | | | Thare are some inconsistent spellings 'MLlib' and 'MLLib' in some documents and source codes. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2903 from sarutak/SPARK-4055 and squashes the following commits: b031640 [Kousuke Saruta] Fixed inconsistent spelling "MLlib and MLLib"
* [BUILD] Fixed resolver for scalastyle plugin and upgrade sbt version.Prashant Sharma2014-10-222-1/+3
| | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Closes #2877 from ScrapCodes/scalastyle-fix and squashes the following commits: a17b9fe [Prashant Sharma] [BUILD] Fixed resolver for scalastyle plugin.
* [SPARK-3812] [BUILD] Adapt maven build to publish effective pom.Prashant Sharma2014-10-221-1/+32
| | | | | | | | | | | | | | | | | | I have tried maven help plugin first but that published all projects in top level pom. So I was left with no choice but to roll my own trivial plugin. This patch basically installs an effective pom after maven install is finished. The problem it fixes is described as follows: If you install using maven ` mvn install -DskipTests -Dhadoop.version=2.2.0 -Phadoop-2.2 ` Then without this patch the published pom(s) will have hadoop version as 1.0.4. This can be a problem at some point. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #2673 from ScrapCodes/build-changes-effective-pom and squashes the following commits: aa7b91d [Prashant Sharma] used an unused dep. 0300dac [Prashant Sharma] improved comment messages.. 28f891e [Prashant Sharma] Added a useless dependency, so that we can shade it. And realized fake shading works for us. 553d96b [Prashant Sharma] Shaded some unused class of an unused dep, to generate effective pom(s)
* [SPARK-3877][YARN] Throw an exception when application is not successful so ↵zsxwing2014-10-225-35/+44
| | | | | | | | | | | | | | | | | | that the exit code wil be set to 1 When an yarn application fails (yarn-cluster mode), the exit code of spark-submit is still 0. It's hard for people to write some automatic scripts to run spark jobs in yarn because the failure can not be detected in these scripts. This PR added a status checking after `monitorApplication`. If an application is not successful, `run()` will throw an `SparkException`, so that Client.scala will exit with code 1. Therefore, people can use the exit code of `spark-submit` to write some automatic scripts. Author: zsxwing <zsxwing@gmail.com> Closes #2732 from zsxwing/SPARK-3877 and squashes the following commits: 1f89fa5 [zsxwing] Fix the unit test a0498e1 [zsxwing] Update the docs and the error message e1cb9ef [zsxwing] Fix the hacky way of calling Client ff16fec [zsxwing] Remove System.exit in Client.scala and add a test 6a2c103 [zsxwing] [SPARK-3877] Throw an exception when application is not successful so that the exit code wil be set to 1
* [SPARK-3426] Fix sort-based shuffle error when spark.shuffle.compress and ↵Josh Rosen2014-10-226-11/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | spark.shuffle.spill.compress settings are different This PR fixes SPARK-3426, an issue where sort-based shuffle crashes if the `spark.shuffle.spill.compress` and `spark.shuffle.compress` settings have different values. The problem is that sort-based shuffle's read and write paths use different settings for determining whether to apply compression. ExternalSorter writes runs to files using `TempBlockId` ids, which causes `spark.shuffle.spill.compress` to be used for enabling compression, but these spilled files end up being shuffled over the network and read as shuffle files using `ShuffleBlockId` by BlockStoreShuffleFetcher, which causes `spark.shuffle.compress` to be used for enabling decompression. As a result, this leads to errors when these settings disagree. Based on the discussions in #2247 and #2178, it sounds like we don't want to remove the `spark.shuffle.spill.compress` setting. Therefore, I've tried to come up with a fix where `spark.shuffle.spill.compress` is used to compress data that's read and written locally and `spark.shuffle.compress` is used to compress any data that will be fetched / read as shuffle blocks. To do this, I split `TempBlockId` into two new id types, `TempLocalBlockId` and `TempShuffleBlockId`, which map to `spark.shuffle.spill.compress` and `spark.shuffle.compress`, respectively. ExternalAppendOnlyMap also used temp blocks for spilling data. It looks like ExternalSorter was designed to be a generic sorter but its configuration already happens to be tied to sort-based shuffle, so I think it's fine if we use `spark.shuffle.compress` to compress its spills; we can move the compression configuration to the constructor in a later commit if we find that ExternalSorter is being used in other contexts where we want different configuration options to control compression. To summarize: **Before:** | | ExternalAppendOnlyMap | ExternalSorter | |-------|------------------------------|------------------------------| | Read | spark.shuffle.spill.compress | spark.shuffle.compress | | Write | spark.shuffle.spill.compress | spark.shuffle.spill.compress | **After:** | | ExternalAppendOnlyMap | ExternalSorter | |-------|------------------------------|------------------------| | Read | spark.shuffle.spill.compress | spark.shuffle.compress | | Write | spark.shuffle.spill.compress | spark.shuffle.compress | Thanks to andrewor14 for debugging this with me! Author: Josh Rosen <joshrosen@databricks.com> Closes #2890 from JoshRosen/SPARK-3426 and squashes the following commits: 1921cf6 [Josh Rosen] Minor edit for clarity. c8dd8f2 [Josh Rosen] Add comment explaining use of createTempShuffleBlock(). 2c687b9 [Josh Rosen] Fix SPARK-3426. 91e7e40 [Josh Rosen] Combine tests into single test of all combinations 76ca65e [Josh Rosen] Add regression test for SPARK-3426.
* Fix for sampling error in NumPy v1.9 [SPARK-3995][PYSPARK]freeman2014-10-222-2/+8
| | | | | | | | | | | | | | Change maximum value for default seed during RDD sampling so that it is strictly less than 2 ** 32. This prevents a bug in the most recent version of NumPy, which cannot accept random seeds above this bound. Adds an extra test that uses the default seed (instead of setting it manually, as in the docstrings). mengxr Author: freeman <the.freeman.lab@gmail.com> Closes #2889 from freeman-lab/pyspark-sampling and squashes the following commits: dc385ef [freeman] Change maximum value for default seed
* use isRunningLocally rather than runningLocallyCrazyJvm2014-10-221-1/+1
| | | | | | | | | | runningLocally is deprecated now Author: CrazyJvm <crazyjvm@gmail.com> Closes #2879 from CrazyJvm/runningLocally and squashes the following commits: bec0b3e [CrazyJvm] use isRunningLocally rather than runningLocally
* Update JavaCustomReceiver.javaKarthik2014-10-221-1/+1
| | | | | | | | | | Changed the usage string to correctly reflect the file name. Author: Karthik <karthik.gomadam@gmail.com> Closes #2699 from namelessnerd/patch-1 and squashes the following commits: 8570e33 [Karthik] Update JavaCustomReceiver.java
* SPARK-1813. Add a utility to SparkConf that makes using Kryo really easySandy Ryza2014-10-2117-87/+195
| | | | | | | | | | | | Author: Sandy Ryza <sandy@cloudera.com> Closes #789 from sryza/sandy-spark-1813 and squashes the following commits: 48b05e9 [Sandy Ryza] Simplify b824932 [Sandy Ryza] Allow both spark.kryo.classesToRegister and spark.kryo.registrator at the same time 6a15bb7 [Sandy Ryza] Small fix a2278c0 [Sandy Ryza] Respond to review comments 6ef592e [Sandy Ryza] SPARK-1813. Add a utility to SparkConf that makes using Kryo really easy
* [SQL]redundant methods for broadcastwangfei2014-10-211-4/+0
| | | | | | | | | | redundant methods for broadcast in ```TableReader``` Author: wangfei <wangfei1@huawei.com> Closes #2862 from scwf/TableReader and squashes the following commits: 414cc24 [wangfei] unnecessary methods for broadcast
* SPARK-3568 [mllib] add ranking metricscoderxiang2014-10-212-0/+206
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add common metrics for ranking algorithms (http://www-nlp.stanford.edu/IR-book/), including: - Mean Average Precision - Precisionn: top-n precision - Discounted cumulative gain (DCG) and NDCG The following methods and the corresponding tests are implemented: ``` class RankingMetrics[T](predictionAndLabels: RDD[(Array[T], Array[T])]) { /* Returns the precsionk for each query */ lazy val precAtK: RDD[Array[Double]] /** * param k the position to compute the truncated precision * return the average precision at the first k ranking positions */ def precision(k: Int): Double /* Returns the average precision for each query */ lazy val avePrec: RDD[Double] /*Returns the mean average precision (MAP) of all the queries*/ lazy val meanAvePrec: Double /*Returns the normalized discounted cumulative gain for each query */ lazy val ndcgAtK: RDD[Array[Double]] /** * param k the position to compute the truncated ndcg * return the average ndcg at the first k ranking positions */ def ndcg(k: Int): Double } ``` Author: coderxiang <shuoxiangpub@gmail.com> Closes #2667 from coderxiang/rankingmetrics and squashes the following commits: d881097 [coderxiang] update doc 14d9cd9 [coderxiang] remove unexpected files d7fb93f [coderxiang] style change and remove ignored files f113ee1 [coderxiang] modify doc for displaying superscript and subscript f626896 [coderxiang] improve doc and remove unnecessary computation while labSet is empty be6645e [coderxiang] set the precision of empty labset to 0.0 d64c120 [coderxiang] add logWarning for empty ground truth set dfae292 [coderxiang] handle empty labSet for map. add test 62047c4 [coderxiang] style change and add documentation f66612d [coderxiang] add additional test of precisionAt b794cb2 [coderxiang] move private members precAtK, ndcgAtK into public methods. style change 77c9e5d [coderxiang] set precAtK and ndcgAtK as private member. Improve documentation 5f87bce [coderxiang] add API to calculate precision and ndcg at each ranking position b7851cc [coderxiang] Use generic type to represent IDs e443fee [coderxiang] change style and use alternative builtin methods 3a5a6ff [coderxiang] add ranking metrics
* [SPARK-3994] Use standard Aggregator code path for countByKey and countByValueAaron Davidson2014-10-212-26/+16
| | | | | | | | | | | | See [JIRA](https://issues.apache.org/jira/browse/SPARK-3994) for more information. Also adds a note which warns against using these methods. Author: Aaron Davidson <aaron@databricks.com> Closes #2839 from aarondav/countByKey and squashes the following commits: d6fdb2a [Aaron Davidson] Respond to comments e1f06d3 [Aaron Davidson] [SPARK-3994] Use standard Aggregator code path for countByKey and countByValue
* SPARK-3770: Make userFeatures accessible from pythonMichelangelo D'Agostino2014-10-212-0/+36
| | | | | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-3770 We need access to the underlying latent user features from python. However, the userFeatures RDD from the MatrixFactorizationModel isn't accessible from the python bindings. I've added a method to the underlying scala class to turn the RDD[(Int, Array[Double])] to an RDD[String]. This is then accessed from the python recommendation.py Author: Michelangelo D'Agostino <mdagostino@civisanalytics.com> Closes #2636 from mdagost/mf_user_features and squashes the following commits: c98f9e2 [Michelangelo D'Agostino] Added unit tests for userFeatures and productFeatures and merged master. d5eadf8 [Michelangelo D'Agostino] Merge branch 'master' into mf_user_features 2481a2a [Michelangelo D'Agostino] Merged master and resolved conflict. a6ffb96 [Michelangelo D'Agostino] Eliminated a function from our first approach to this problem that is no longer needed now that we added the fromTuple2RDD function. 2aa1bf8 [Michelangelo D'Agostino] Implemented a function called fromTuple2RDD in PythonMLLibAPI and used it to expose the MF userFeatures and productFeatures in python. 34cb2a2 [Michelangelo D'Agostino] A couple of lint cleanups and a comment. cdd98e3 [Michelangelo D'Agostino] It's working now. e1fbe5e [Michelangelo D'Agostino] Added scala function to stringify userFeatures for access in python.
* [SPARK-4020] Do not rely on timeouts to remove failed block managersAndrew Or2014-10-211-1/+1
| | | | | | | | | | | | | | | | | If an executor fails without being scheduled to run any tasks, then `DAGScheduler` won't notify `BlockManagerMasterActor` that the associated block manager should be removed. Instead, the associated block manager will be expired only after a few rounds of heartbeat timeouts. In terms of removal treatment, there should really be no distinction between executors that have been scheduled tasks and those that have not. The fix, then, is to add all known executors to `TaskSchedulerImpl`'s `activeExecutorIds` whether or not it has been scheduled a task. In fact, the existing comment above `activeExecutorIds` is ``` // Which executor IDs we have executors on val activeExecutorIds = new HashSet[String] ``` not "Which executors have been scheduled tasks thus far." Author: Andrew Or <andrewor14@gmail.com> Closes #2865 from andrewor14/active-executors and squashes the following commits: ff3172b [Andrew Or] Add all known executors to `activeExecutorIds`
* [SPARK-4035] Fix a wrong format specifierzsxwing2014-10-211-1/+1
| | | | | | | | | | Just found a typo. Should not use "%f" for Long. Author: zsxwing <zsxwing@gmail.com> Closes #2875 from zsxwing/SPARK-4035 and squashes the following commits: ce347e2 [zsxwing] Fix a wrong format specifier
* replace awaitTransformation with awaitTermination in scaladoc/javadocHolden Karau2014-10-213-3/+3
| | | | | | | | | Author: Holden Karau <holden@pigscanfly.ca> Closes #2861 from holdenk/SPARK-4015-Documentation-in-the-streaming-context-references-non-existent-function and squashes the following commits: 081db8a [Holden Karau] fix pyspark streaming doc too 0e03863 [Holden Karau] replace awaitTransformation with awaitTermination
* [SPARK-4023] [MLlib] [PySpark] convert rdd into RDD of VectorDavies Liu2014-10-212-4/+24
| | | | | | | | | | | | | Convert the input rdd to RDD of Vector. cc mengxr Author: Davies Liu <davies@databricks.com> Closes #2870 from davies/fix4023 and squashes the following commits: 1eac767 [Davies Liu] address comments 0871576 [Davies Liu] convert rdd into RDD of Vector
* [SPARK-3958] TorrentBroadcast cleanup / debugging improvements.Josh Rosen2014-10-213-92/+97
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR makes several changes to TorrentBroadcast in order to make it easier to reason about, which should help when debugging SPARK-3958. The key changes: - Remove all state from the global TorrentBroadcast object. This state consisted mainly of configuration options, like the block size and compression codec, and was read by the blockify / unblockify methods. Unfortunately, the use of `lazy val` for `BLOCK_SIZE` meant that the block size was always determined by the first SparkConf that TorrentBroadast was initialized with; as a result, unit tests could not properly test TorrentBroadcast with different block sizes. Instead, blockifyObject and unBlockifyObject now accept compression codecs and blockSizes as arguments. These arguments are supplied at the call sites inside of TorrentBroadcast instances. Each TorrentBroadcast instance determines these values from SparkEnv's SparkConf. I was careful to ensure that we do not accidentally serialize CompressionCodec or SparkConf objects as part of the TorrentBroadcast object. - Remove special-case handling of local-mode in TorrentBroadcast. I don't think that broadcast implementations should know about whether we're running in local mode. If we want to optimize the performance of broadcast in local mode, then we should detect this at a higher level and use a dummy LocalBroadcastFactory implementation instead. Removing this code fixes a subtle error condition: in the old local mode code, a failure to find the broadcast in the local BlockManager would lead to an attempt to deblockify zero blocks, which could lead to confusing deserialization or decompression errors when we attempted to decompress an empty byte array. This should never have happened, though: a failure to find the block in local mode is evidence of some other error. The changes here will make it easier to debug those errors if they ever happen. - Add a check that throws an exception when attempting to deblockify an empty array. - Use ScalaCheck to add a test to check that TorrentBroadcast's blockifyObject and unBlockifyObject methods are inverses. - Misc. cleanup and logging improvements. Author: Josh Rosen <joshrosen@databricks.com> Closes #2844 from JoshRosen/torrentbroadcast-bugfix and squashes the following commits: 1e8268d [Josh Rosen] Address Reynold's review comments 2a9fdfd [Josh Rosen] Address Reynold's review comments. c3b08f9 [Josh Rosen] Update TorrentBroadcast tests to reflect removal of special local-mode optimizations. 5c22782 [Josh Rosen] Store broadcast variable's value in the driver. 33fc754 [Josh Rosen] Change blockify/unblockifyObject to accept serializer as argument. 618a872 [Josh Rosen] [SPARK-3958] TorrentBroadcast cleanup / debugging improvements.
* Update Building Spark link.Reynold Xin2014-10-201-1/+1
|
* [SPARK-3940][SQL] Avoid console printing error messages three timeswangxiaojing2014-10-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If wrong sql,the console print error one times。 eg: <pre> spark-sql> show tabless; show tabless; 14/10/13 21:03:48 INFO ParseDriver: Parsing command: show tabless ............ at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:274) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:209) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) Caused by: org.apache.hadoop.hive.ql.parse.ParseException: line 1:5 cannot recognize input near 'show' 'tabless' '<EOF>' in ddl statement at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:193) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:161) at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:218) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:226) ... 47 more Time taken: 4.35 seconds 14/10/13 21:03:51 INFO CliDriver: Time taken: 4.35 seconds </pre> Author: wangxiaojing <u9jing@gmail.com> Closes #2790 from wangxiaojing/spark-3940 and squashes the following commits: e2e5c14 [wangxiaojing] sql Print the error code three times
* [SPARK-3969][SQL] Optimizer should have a super class as an interface.Takuya UESHIN2014-10-203-5/+7
| | | | | | | | | | | | Some developers want to replace `Optimizer` to fit their projects but can't do so because currently `Optimizer` is an `object`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #2825 from ueshin/issues/SPARK-3969 and squashes the following commits: abbc53c [Takuya UESHIN] Re-rename Optimizer object. 4d2e1bc [Takuya UESHIN] Rename Optimizer object. 9547a23 [Takuya UESHIN] Extract abstract class from Optimizer for developers to be able to replace Optimizer.
* [SPARK-3945]Properties of hive-site.xml is invalid in running the Thrift ↵luogankun2014-10-201-1/+5
| | | | | | | | | | | | | | JDBC server Write properties of hive-site.xml to HiveContext when initilize session state in SparkSQLEnv.scala. The method of SparkSQLEnv.init() in HiveThriftServer2.scala can not write the properties of hive-site.xml to HiveContext. Such as: add configuration property spark.sql.shuffle.partititions in the hive-site.xml. Author: luogankun <luogankun@gmail.com> Closes #2800 from luogankun/SPARK-3945 and squashes the following commits: 3679efc [luogankun] [SPARK-3945]Write properties of hive-site.xml to HiveContext when initilize session state In SparkSQLEnv.scala
* [SPARK-3966][SQL] Fix nullabilities of Cast related to DateType.Takuya UESHIN2014-10-201-0/+4
| | | | | | | | Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #2820 from ueshin/issues/SPARK-3966 and squashes the following commits: ca4a745 [Takuya UESHIN] Fix nullabilities of Cast related to DateType.
* [SPARK-3800][SQL] Clean aliases from grouping expressionsMichael Armbrust2014-10-202-0/+35
| | | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #2658 from marmbrus/nestedAggs and squashes the following commits: 862b763 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into nestedAggs 3234521 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into nestedAggs 8b06fdc [Michael Armbrust] possible fix for grouping on nested fields
* [SPARK-3906][SQL] Adds multiple join support for SQLContextCheng Lian2014-10-202-3/+16
| | | | | | | | Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2767 from liancheng/multi-join and squashes the following commits: 9dc0d18 [Cheng Lian] Adds multiple join support for SQLContext
* [SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree ↵Qiping Li2014-10-205-16/+176
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | more adaptively DecisionTree splits on continuous features by choosing an array of values from a subsample of the data. Currently, it does not check for identical values in the subsample, so it could end up having multiple copies of the same split. In this PR, we choose splits for a continuous feature in 3 steps: 1. Sort sample values for this feature 2. Get number of occurrence of each distinct value 3. Iterate the value count array computed in step 2 to choose splits. After find splits, `numSplits` and `numBins` in metadata will be updated. CC: mengxr manishamde jkbradley, please help me review this, thanks. Author: Qiping Li <liqiping1991@gmail.com> Author: chouqin <liqiping1991@gmail.com> Author: liqi <liqiping1991@gmail.com> Author: qiping.lqp <qiping.lqp@alibaba-inc.com> Closes #2780 from chouqin/dt-findsplits and squashes the following commits: 18d0301 [Qiping Li] check explicitly findsplits return distinct splits 8dc28ab [chouqin] remove blank lines ffc920f [chouqin] adjust code based on comments and add more test cases 9857039 [chouqin] Merge branch 'master' of https://github.com/apache/spark into dt-findsplits d353596 [qiping.lqp] fix pyspark doc test 9e64699 [Qiping Li] fix random forest unit test 3c72913 [Qiping Li] fix random forest unit test 092efcb [Qiping Li] fix bug f69f47f [Qiping Li] fix bug ab303a4 [Qiping Li] fix bug af6dc97 [Qiping Li] fix bug 2a8267a [Qiping Li] fix bug c339a61 [Qiping Li] fix bug 369f812 [Qiping Li] fix style 8f46af6 [Qiping Li] add comments and unit test 9e7138e [Qiping Li] Merge branch 'dt-findsplits' of https://github.com/chouqin/spark into dt-findsplits 1b25a35 [Qiping Li] Merge branch 'master' of https://github.com/apache/spark into dt-findsplits 0cd744a [liqi] fix bug 3652823 [Qiping Li] fix bug af7cb79 [Qiping Li] Choose splits for continuous features in DecisionTree more adaptively
* [SPARK-3736] Workers reconnect when disassociated from the master.mcheah2014-10-203-20/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before, if the master node is killed and restarted, the worker nodes would not attempt to reconnect to the Master. Therefore, when the Master node was restarted, the worker nodes needed to be restarted as well. Now, when the Master node is disconnected, the worker nodes will continuously ping the master node in attempts to reconnect to it. Once the master node restarts, it will detect one of the registration requests from its former workers. The result is that the cluster re-enters a healthy state. In addition, when the master does not receive a heartbeat from the worker, the worker was removed; however, when the worker sent a heartbeat to the master, the master used to ignore the heartbeat. Now, a master that receives a heartbeat from a worker that had been disconnected will request the worker to re-attempt the registration process, at which point the worker will send a RegisterWorker request and be re-connected accordingly. Re-connection attempts per worker are submitted every N seconds, where N is configured by the property spark.worker.reconnect.interval - this has a default of 60 seconds right now. Author: mcheah <mcheah@palantir.com> Closes #2828 from mccheah/reconnect-dead-workers and squashes the following commits: 83f8bc9 [mcheah] [SPARK-3736] More informative log message, and fixing some indentation. fe0e02f [mcheah] [SPARK-3736] Moving reconnection logic to registerWithMaster(). 94ddeca [mcheah] [SPARK-3736] Changing a log warning to a log info. a698e35 [mcheah] [SPARK-3736] Addressing PR comment to make some defs private. b9a3077 [mcheah] [SPARK-3736] Addressing PR comments related to reconnection. 2ad5ed5 [mcheah] [SPARK-3736] Cancel attempts to reconnect if the master changes. b5b34af [mcheah] [SPARK-3736] Workers reconnect when disassociated from the master.
* [SPARK-3986][SQL] Fix package names to fit their directory names.Takuya UESHIN2014-10-202-4/+2
| | | | | | | | | | | | Package names of 2 test suites are different from their directory names. - `GeneratedEvaluationSuite` - `GeneratedMutableEvaluationSuite` Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #2835 from ueshin/issues/SPARK-3986 and squashes the following commits: fa2cc05 [Takuya UESHIN] Fix package names to fit their directory names.
* [SPARK-4010][Web UI]Spark UI returns 500 in yarn-client modeGuoQiang Li2014-10-202-5/+5
| | | | | | | | | | | The problem caused by #1966 CC YanTangZhai andrewor14 Author: GuoQiang Li <witgo@qq.com> Closes #2858 from witgo/SPARK-4010 and squashes the following commits: 9866fbf [GuoQiang Li] Spark UI returns 500 in yarn-client mode
* [SPARK-3948][Shuffle]Fix stream corruption bug in sort-based shufflejerryshao2014-10-202-6/+28
| | | | | | | | | | | | Kernel 2.6.32 bug will lead to unexpected behavior of transferTo in copyStream, and this will corrupt the shuffle output file in sort-based shuffle, which will somehow introduce PARSING_ERROR(2), deserialization error or offset out of range. Here fix this by adding append flag, also add some position checking code. Details can be seen in [SPARK-3948](https://issues.apache.org/jira/browse/SPARK-3948). Author: jerryshao <saisai.shao@intel.com> Closes #2824 from jerryshao/SPARK-3948 and squashes the following commits: be0533a [jerryshao] Address the comments a82b184 [jerryshao] add configuration to control the NIO way of copying stream e17ada2 [jerryshao] Fix kernel 2.6.32 bug led unexpected behavior of transferTo
* [SPARK-3902] [SPARK-3590] Stabilize AsynRDDActions and add Java APIJosh Rosen2014-10-196-35/+246
| | | | | | | | | | | | | | | | | | This PR adds a Java API for AsyncRDDActions and promotes the API from `Experimental` to stable. Author: Josh Rosen <joshrosen@apache.org> Author: Josh Rosen <joshrosen@databricks.com> Closes #2760 from JoshRosen/async-rdd-actions-in-java and squashes the following commits: 0d45fbc [Josh Rosen] Whitespace fix. ad3ae53 [Josh Rosen] Merge remote-tracking branch 'origin/master' into async-rdd-actions-in-java c0153a5 [Josh Rosen] Remove unused variable. e8e2867 [Josh Rosen] Updates based on Marcelo's review feedback 7a1417f [Josh Rosen] Removed unnecessary java.util import. 6f8f6ac [Josh Rosen] Fix import ordering. ff28e49 [Josh Rosen] Add MiMa excludes and fix a scalastyle error. 346e46e [Josh Rosen] [SPARK-3902] Stabilize AsyncRDDActions; add Java API.
* [SPARK-2546] Clone JobConf for each task (branch-1.0 / 1.1 backport)Josh Rosen2014-10-192-15/+47
| | | | | | | | | | | | | | | | | | | | This patch attempts to fix SPARK-2546 in `branch-1.0` and `branch-1.1`. The underlying problem is that thread-safety issues in Hadoop Configuration objects may cause Spark tasks to get stuck in infinite loops. The approach taken here is to clone a new copy of the JobConf for each task rather than sharing a single copy between tasks. Note that there are still Configuration thread-safety issues that may affect the driver, but these seem much less likely to occur in practice and will be more complex to fix (see discussion on the SPARK-2546 ticket). This cloning is guarded by a new configuration option (`spark.hadoop.cloneConf`) and is disabled by default in order to avoid unexpected performance regressions for workloads that are unaffected by the Configuration thread-safety issues. Author: Josh Rosen <joshrosen@apache.org> Closes #2684 from JoshRosen/jobconf-fix-backport and squashes the following commits: f14f259 [Josh Rosen] Add configuration option to control cloning of Hadoop JobConf. b562451 [Josh Rosen] Remove unused jobConfCacheKey field. dd25697 [Josh Rosen] [SPARK-2546] [1.0 / 1.1 backport] Clone JobConf for each task. (cherry picked from commit 2cd40db2b3ab5ddcb323fd05c171dbd9025f9e71) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
* [SPARK-3952] [Streaming] [PySpark] add Python examples in Streaming ↵Davies Liu2014-10-185-14/+391
| | | | | | | | | | | | | | | | | | Programming Guide Having Python examples in Streaming Programming Guide. Also add RecoverableNetworkWordCount example. Author: Davies Liu <davies.liu@gmail.com> Author: Davies Liu <davies@databricks.com> Closes #2808 from davies/pyguide and squashes the following commits: 8d4bec4 [Davies Liu] update readme 26a7e37 [Davies Liu] fix format 3821c4d [Davies Liu] address comments, add missing file 7e4bb8a [Davies Liu] add Python examples in Streaming Programming Guide
* SPARK-3926 [CORE] Result of JavaRDD.collectAsMap() is not SerializableSean Owen2014-10-184-9/+23
| | | | | | | | | | | | | Make JavaPairRDD.collectAsMap result Serializable since Java Maps generally are Author: Sean Owen <sowen@cloudera.com> Closes #2805 from srowen/SPARK-3926 and squashes the following commits: ecb78ee [Sean Owen] Fix conflict between java.io.Serializable and use of Scala's Serializable f4717f9 [Sean Owen] Oops, fix compile problem ae1b36f [Sean Owen] Expand to cover Maps returned from other Java API methods as well 51c26c2 [Sean Owen] Make JavaPairRDD.collectAsMap result Serializable since Java Maps generally are
* [SPARK-3934] [SPARK-3918] [mllib] Bug fixes for RandomForest, DecisionTreeJoseph K. Bradley2014-10-176-20/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SPARK-3934: When run with a mix of unordered categorical and continuous features, on multiclass classification, RandomForest fails. The bug is in the sanity checks in getFeatureOffset and getLeftRightFeatureOffsets, which use the wrong indices for checking whether features are unordered. Fix: Remove the sanity checks since they are not really needed, and since they would require DTStatsAggregator to keep track of an extra set of indices (for the feature subset). Added test to RandomForestSuite which failed with old version but now works. SPARK-3918: Added baggedInput.unpersist at end of training. Also: * I removed DTStatsAggregator.isUnordered since it is no longer used. * DecisionTreeMetadata: Added logWarning when maxBins is automatically reduced. * Updated DecisionTreeRunner to explicitly fix the test data to have the same number of features as the training data. This is a temporary fix which should eventually be replaced by pre-indexing both datasets. * RandomForestModel: Updated toString to print total number of nodes in forest. * Changed Predict class to be public DeveloperApi. This was necessary to allow users to create their own trees by hand (for testing). CC: mengxr manishamde chouqin codedeft Just notifying you of these small bug fixes. Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #2785 from jkbradley/dtrunner-update and squashes the following commits: 9132321 [Joseph K. Bradley] merged with master, fixed imports 9dbd000 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update e116473 [Joseph K. Bradley] Changed Predict class to be public DeveloperApi. f502e65 [Joseph K. Bradley] bug fix for SPARK-3934 7f3d60f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update ba567ab [Joseph K. Bradley] Changed DTRunner to load test data using same number of features as in training data. 4e88c1f [Joseph K. Bradley] changed RF toString to print total number of nodes
* [SPARK-3985] [Examples] fix file path using os.path.joinDaoyuan Wang2014-10-171-1/+1
| | | | | | | | Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #2834 from adrian-wang/sqlpypath and squashes the following commits: da7aa95 [Daoyuan Wang] fix file path using path.join
* [SPARK-3855][SQL] Preserve the result attribute of python UDFs though ↵Michael Armbrust2014-10-173-3/+17
| | | | | | | | | | | | | | transformations In the current implementation it was possible for the reference to change after analysis. Author: Michael Armbrust <michael@databricks.com> Closes #2717 from marmbrus/pythonUdfResults and squashes the following commits: da14879 [Michael Armbrust] Fix test 6343bcb [Michael Armbrust] add test 9533286 [Michael Armbrust] Correctly preserve the result attribute of python UDFs though transformations
* [SPARK-3979] [yarn] Use fs's default replication.Marcelo Vanzin2014-10-171-1/+2
| | | | | | | | | | | | | | | This avoids issues when HDFS is configured in a way that would not allow the hardcoded default replication of "3". Note: getDefaultReplication(Path) was added in 0.23.3, and the oldest one available on Maven Central is 0.23.7, so I chose to not add code to access that method via reflection. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #2831 from vanzin/SPARK-3979 and squashes the following commits: b0e3a97 [Marcelo Vanzin] [SPARK-3979] [yarn] Use fs's default replication.
* [SPARK-3935][Core] log the number of records that has been writtenlikun2014-10-171-2/+0
| | | | | | | | | | | | There is a unused variable(count) in saveAsHadoopDataset in PairRDDFunctions.scala. The initial idea of this variable seems to count the number of records, so I am adding a log statement to log the number of records that has been written to the writer. Author: likun <jacky.likun@huawei.com> Author: jackylk <jacky.likun@huawei.com> Closes #2791 from jackylk/SPARK-3935 and squashes the following commits: a874047 [jackylk] removing the unused variable in PairRddFunctions.scala 3bf43c7 [likun] log the number of records has been written