aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* SPARK-3926 [CORE] Reopened: result of JavaRDD collectAsMap() is not serializableSean Owen2014-12-083-3/+75
| | | | | | | | | | | My original 'fix' didn't fix at all. Now, there's a unit test to check whether it works. Of the two options to really fix it -- copy the `Map` to a `java.util.HashMap`, or copy and modify Scala's implementation in `Wrappers.MapWrapper`, I went with the latter. Author: Sean Owen <sowen@cloudera.com> Closes #3587 from srowen/SPARK-3926 and squashes the following commits: 8586bb9 [Sean Owen] Remove unneeded no-arg constructor, and add additional note about copied code in LICENSE 7bb0e66 [Sean Owen] Make SerializableMapWrapper actually serialize, and add unit test
* [SPARK-4750] Dynamic allocation - synchronize killsAndrew Or2014-12-081-1/+1
| | | | | | | | | | Simple omission on my part. Author: Andrew Or <andrew@databricks.com> Closes #3612 from andrewor14/dynamic-allocation-synchronization and squashes the following commits: 1f03b60 [Andrew Or] Synchronize kills
* [SPARK-4774] [SQL] Makes HiveFromSpark more portableKostas Sakellis2014-12-081-2/+11
| | | | | | | | | | | | HiveFromSpark read the kv1.txt file from SPARK_HOME/examples/src/main/resources/kv1.txt which assumed you had a source tree checked out. Now we copy the kv1.txt file to a temporary file and delete it when the jvm shuts down. This allows us to run this example outside of a spark source tree. Author: Kostas Sakellis <kostas@cloudera.com> Closes #3628 from ksakellis/kostas-spark-4774 and squashes the following commits: 6770f83 [Kostas Sakellis] [SPARK-4774] [SQL] Makes HiveFromSpark more portable
* [SPARK-4764] Ensure that files are fetched atomicallyChristophe Préaud2014-12-081-2/+1
| | | | | | | | | | | | | | | tempFile is created in the same directory than targetFile, so that the move from tempFile to targetFile is always atomic Author: Christophe Préaud <christophe.preaud@kelkoo.com> Closes #2855 from preaudc/master and squashes the following commits: 9ba89ca [Christophe Préaud] Ensure that files are fetched atomically 54419ae [Christophe Préaud] Merge remote-tracking branch 'upstream/master' c6a5590 [Christophe Préaud] Revert commit 8ea871f8130b2490f1bad7374a819bf56f0ccbbd 7456a33 [Christophe Préaud] Merge remote-tracking branch 'upstream/master' 8ea871f [Christophe Préaud] Ensure that files are fetched atomically
* [SPARK-4620] Add unpersist in Graph and GraphImplTakeshi Yamamuro2014-12-072-0/+12
| | | | | | | | | | | | | | Add an IF to uncache both vertices and edges of Graph/GraphImpl. This IF is useful when iterative graph operations build a new graph in each iteration, and the vertices and edges of previous iterations are no longer needed for following iterations. Author: Takeshi Yamamuro <linguin.m.s@gmail.com> This patch had conflicts when merged, resolved by Committer: Ankur Dave <ankurdave@gmail.com> Closes #3476 from maropu/UnpersistInGraphSpike and squashes the following commits: 77a006a [Takeshi Yamamuro] Add unpersist in Graph and GraphImpl
* [SPARK-4646] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in SparkTakeshi Yamamuro2014-12-072-5/+64
| | | | | | | | | | | | This patch just replaces a native quick sorter with Sorter(TimSort) in Spark. It could get performance gains by ~8% in my quick experiments. Author: Takeshi Yamamuro <linguin.m.s@gmail.com> Closes #3507 from maropu/TimSortInEdgePartitionBuilderSpike and squashes the following commits: 8d4e5d2 [Takeshi Yamamuro] Remove a wildcard import 3527e00 [Takeshi Yamamuro] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark
* [SPARK-3623][GraphX] GraphX should support the checkpoint operationGuoQiang Li2014-12-063-0/+34
| | | | | | | | | | Author: GuoQiang Li <witgo@qq.com> Closes #2631 from witgo/SPARK-3623 and squashes the following commits: a70c500 [GuoQiang Li] Remove java related 4d1e249 [GuoQiang Li] Add comments e682724 [GuoQiang Li] Graph should support the checkpoint operation
* Streaming doc : do you mean inadvertently?CrazyJvm2014-12-051-1/+1
| | | | | | | | Author: CrazyJvm <crazyjvm@gmail.com> Closes #3620 from CrazyJvm/streaming-foreachRDD and squashes the following commits: b72886b [CrazyJvm] do you mean inadvertently?
* [SPARK-4005][CORE] handle message replies in receive instead of in the ↵Zhang, Liye2014-12-051-10/+7
| | | | | | | | | | | | | | | individual private methods In BlockManagermasterActor, when handling message type UpdateBlockInfo, the message replies is in handled in individual private methods, should handle it in receive of Akka. Author: Zhang, Liye <liye.zhang@intel.com> Closes #2853 from liyezhang556520/akkaRecv and squashes the following commits: 9b06f0a [Zhang, Liye] remove the unreachable code bf518cd [Zhang, Liye] change the indent 242166b [Zhang, Liye] modified accroding to the comments d4b929b [Zhang, Liye] [SPARK-4005][CORE] handle message replies in receive instead of in the individual private methods
* [SPARK-4761][SQL] Enables Kryo by default in Spark SQL Thrift serverCheng Lian2014-12-051-2/+12
| | | | | | | | | | | | | | Enables Kryo and disables reference tracking by default in Spark SQL Thrift server. Configurations explicitly defined by users in `spark-defaults.conf` are respected (the Thrift server is started by `spark-submit`, which handles configuration properties properly). <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3621) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3621 from liancheng/kryo-by-default and squashes the following commits: 70c2775 [Cheng Lian] Enables Kryo by default in Spark SQL Thrift server
* [SPARK-4753][SQL] Use catalyst for partition pruning in newParquet.Michael Armbrust2014-12-041-30/+28
| | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #3613 from marmbrus/parquetPartitionPruning and squashes the following commits: 4f138f8 [Michael Armbrust] Use catalyst for partition pruning in newParquet.
* Revert "SPARK-2624 add datanucleus jars to the container in yarn-cluster"Andrew Or2014-12-043-157/+0
| | | | This reverts commit a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53.
* Revert "[HOT FIX] [YARN] Check whether `/lib` exists before listing its files"Andrew Or2014-12-041-15/+12
| | | | This reverts commit 90ec643e9af4c8bbb9000edca08c07afb17939c7.
* [SPARK-4464] Description about configuration options need to be modified in ↵Masayoshi TSUZUKI2014-12-041-2/+10
| | | | | | | | | | | | | | docs. Added description about -h and -host. Modified description about -i and -ip which are now deprecated. Added description about --properties-file. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3329 from tsudukim/feature/SPARK-4464 and squashes the following commits: 6c07caf [Masayoshi TSUZUKI] [SPARK-4464] Description about configuration options need to be modified in docs.
* Fix typo in Spark SQL docs.Andy Konwinski2014-12-041-1/+1
| | | | | | | | Author: Andy Konwinski <andykonwinski@gmail.com> Closes #3611 from andyk/patch-3 and squashes the following commits: 7bab333 [Andy Konwinski] Fix typo in Spark SQL docs.
* [SPARK-4421] Wrong link in spark-standalone.htmlMasayoshi TSUZUKI2014-12-041-1/+1
| | | | | | | | | | Modified the link of building Spark. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3279 from tsudukim/feature/SPARK-4421 and squashes the following commits: 56e31c1 [Masayoshi TSUZUKI] Modified the link of building Spark.
* [SPARK-4397] Move object RDD to the front of RDD.scala.Reynold Xin2014-12-042-11/+26
| | | | | | | | | | | | | I ran into multiple cases that SBT/Scala compiler was confused by the implicits in continuous compilation mode. Adding explicit return types fixes the problem. Author: Reynold Xin <rxin@databricks.com> Closes #3580 from rxin/rdd-implicit and squashes the following commits: ee32fcd [Reynold Xin] Move object RDD to the end of the file. b8562c9 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into rdd-implicit d4e9f85 [Reynold Xin] Code review. a836a37 [Reynold Xin] Move object RDD to the front of RDD.scala.
* [SPARK-4652][DOCS] Add docs about spark-git-repo optionlewuathe2014-12-041-0/+5
| | | | | | | | | | | | | | | | There might be some cases when WIPS spark version need to be run on EC2 cluster. In order to setup this type of cluster more easily, add --spark-git-repo option description to ec2 documentation. Author: lewuathe <lewuathe@me.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #3513 from Lewuathe/doc-for-development-spark-cluster and squashes the following commits: 6dae8ee [lewuathe] Wrap consistent with other descriptions cfaf9be [lewuathe] Add docs about spark-git-repo option (Editing / cleanup by Josh Rosen)
* [SPARK-4459] Change groupBy type parameter from K to USaldanha2014-12-042-7/+51
| | | | | | | | | | | | Please see https://issues.apache.org/jira/browse/SPARK-4459 Author: Saldanha <saldaal1@phusca-l24858.wlan.na.novartis.net> Closes #3327 from alokito/master and squashes the following commits: 54b1095 [Saldanha] [SPARK-4459] changed type parameter for keyBy from K to U d5f73c3 [Saldanha] [SPARK-4459] added keyBy test 316ad77 [Saldanha] SPARK-4459 changed type parameter for groupBy from K to U. 62ddd4b [Saldanha] SPARK-4459 added failing unit test
* [SPARK-4745] Fix get_existing_cluster() function with multiple security groupsalexdebrie2014-12-041-2/+2
| | | | | | | | | | The current get_existing_cluster() function would only find an instance belonged to a cluster if the instance's security groups == cluster_name + "-master" (or "-slaves"). This fix allows for multiple security groups by checking if the cluster_name + "-master" security group is in the list of groups for a particular instance. Author: alexdebrie <alexdebrie1@gmail.com> Closes #3596 from alexdebrie/master and squashes the following commits: 9d51232 [alexdebrie] Fix get_existing_cluster() function with multiple security groups
* [HOTFIX] Fixing two issues with the release script.Patrick Wendell2014-12-041-11/+20
| | | | | | | | | | | 1. The version replacement was still producing some false changes. 2. Uploads to the staging repo specifically. Author: Patrick Wendell <pwendell@gmail.com> Closes #3608 from pwendell/release-script and squashes the following commits: 3c63294 [Patrick Wendell] Fixing two issues with the release script:
* [SPARK-4253] Ignore spark.driver.host in yarn-cluster and standalone-cluster ↵WangTaoTheTonic2014-12-042-1/+6
| | | | | | | | | | | | | | | | | | | | modes In yarn-cluster and standalone-cluster modes, we don't know where driver will run until it is launched. If the `spark.driver.host` property is set on the submitting machine and propagated to the driver through SparkConf then this will lead to errors when the driver launches. This patch fixes this issue by dropping the `spark.driver.host` property in SparkSubmit when running in a cluster deploy mode. Author: WangTaoTheTonic <barneystinson@aliyun.com> Author: WangTao <barneystinson@aliyun.com> Closes #3112 from WangTaoTheTonic/SPARK4253 and squashes the following commits: ed1a25c [WangTaoTheTonic] revert unrelated formatting issue 02c4e49 [WangTao] add comment 32a3f3f [WangTaoTheTonic] ingore it in SparkSubmit instead of SparkContext 667cf24 [WangTaoTheTonic] document fix ff8d5f7 [WangTaoTheTonic] also ignore it in standalone cluster mode 2286e6b [WangTao] ignore spark.driver.host in yarn-cluster mode
* [SPARK-4683][SQL] Add a beeline.cmd to run on WindowsCheng Lian2014-12-041-0/+21
| | | | | | | | | | | | | | | | | | Tested locally with a Win7 VM. Connected to a Spark SQL Thrift server instance running on Mac OS X with the following command line: ``` bin\beeline.cmd -u jdbc:hive2://10.0.2.2:10000 -n lian ``` <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3599) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3599 from liancheng/beeline.cmd and squashes the following commits: 79092e7 [Cheng Lian] Windows script for BeeLine
* [FIX][DOC] Fix broken links in ml-guide.mdXiangrui Meng2014-12-044-7/+5
| | | | | | | | | | | | and some minor changes in ScalaDoc. Author: Xiangrui Meng <meng@databricks.com> Closes #3601 from mengxr/SPARK-4575-fix and squashes the following commits: c559768 [Xiangrui Meng] minor code update ce94da8 [Xiangrui Meng] Java Bean -> JavaBean 0b5c182 [Xiangrui Meng] fix links in ml-guide
* [SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixesJoseph K. Bradley2014-12-0417-24/+1205
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Documentation: * Added ml-guide.md, linked from mllib-guide.md * Updated mllib-guide.md with small section pointing to ml-guide.md Examples: * CrossValidatorExample * SimpleParamsExample * (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md) Bug fixes: * PipelineModel: did not use ParamMaps correctly * UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!) CC: mengxr shivaram etrain Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete. Author: Joseph K. Bradley <joseph@databricks.com> Author: jkbradley <joseph.kurata.bradley@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3588 from jkbradley/ml-package-docs and squashes the following commits: d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit). updated examples for CV and Params for spark.ml c38469c [Joseph K. Bradley] Updated ml-guide with CV examples 99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params. Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold. ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs 3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype 41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version. CrossValidatorExample not working yet. Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.
* [docs] Fix outdated comment in tuning guideJoseph K. Bradley2014-12-041-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | When you use the SPARK_JAVA_OPTS env variable, Spark complains: ``` SPARK_JAVA_OPTS was detected (set to ' -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps '). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with conf/spark-defaults.conf to set defaults for an application - ./spark-submit with --driver-java-options to set -X options for a driver - spark.executor.extraJavaOptions to set -X options for executors - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker) ``` This updates the docs to redirect the user to the relevant part of the configuration docs. CC: mengxr but please CC someone else as needed Author: Joseph K. Bradley <joseph@databricks.com> Closes #3592 from jkbradley/tuning-doc and squashes the following commits: 0760ce1 [Joseph K. Bradley] fixed outdated comment in tuning guide
* [SQL] Minor: Avoid calling Seq#size in a loopAaron Davidson2014-12-041-3/+3
| | | | | | | | | | Just found this instance while doing some jstack-based profiling of a Spark SQL job. It is very unlikely that this is causing much of a perf issue anywhere, but it is unnecessarily suboptimal. Author: Aaron Davidson <aaron@databricks.com> Closes #3593 from aarondav/seq-opt and squashes the following commits: 962cdfc [Aaron Davidson] [SQL] Minor: Avoid calling Seq#size in a loop
* [SPARK-4685] Include all spark.ml and spark.mllib packages in JavaDoc's ↵lewuathe2014-12-041-1/+4
| | | | | | | | | | | | | | | | | | MLlib group This is #3554 from Lewuathe except that I put both `spark.ml` and `spark.mllib` in the group 'MLlib`. Closes #3554 jkbradley Author: lewuathe <lewuathe@me.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3598 from mengxr/Lewuathe-modify-javadoc-setting and squashes the following commits: 184609a [Xiangrui Meng] merge spark.ml and spark.mllib into the same group in javadoc f7535e6 [lewuathe] [SPARK-4685] Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections
* [SPARK-4719][API] Consolidate various narrow dep RDD classes with ↵Reynold Xin2014-12-0410-245/+55
| | | | | | | | | | | | | MapPartitionsRDD MappedRDD, MappedValuesRDD, FlatMappedValuesRDD, FilteredRDD, GlommedRDD, FlatMappedRDD are not necessary. They can be implemented trivially using MapPartitionsRDD. Author: Reynold Xin <rxin@databricks.com> Closes #3578 from rxin/SPARK-4719 and squashes the following commits: eed9853 [Reynold Xin] Preserve partitioning for filter. eb1a89b [Reynold Xin] [SPARK-4719][API] Consolidate various narrow dep RDD classes with MapPartitionsRDD.
* [SQL] remove unnecessary importJacky Li2014-12-041-1/+0
| | | | | | | | Author: Jacky Li <jacky.likun@huawei.com> Closes #3585 from jackylk/remove and squashes the following commits: 045423d [Jacky Li] remove unnecessary import
* MAINTENANCE: Automated closing of pull requests.Patrick Wendell2014-12-030-0/+0
| | | | | | | | This commit exists to close the following pull requests on Github: Closes #1875 (close requested by 'marmbrus') Closes #3566 (close requested by 'andrewor14') Closes #3487 (close requested by 'pwendell')
* [Release] Correctly translate contributors name in release notesAndrew Or2014-12-034-55/+229
| | | | | | | | | | | | | | | | | | | | | | | | | | This commit involves three main changes: (1) It separates the translation of contributor names from the generation of the contributors list. This is largely motivated by the Github API limit; even if we exceed this limit, we should at least be able to proceed manually as before. This is why the translation logic is abstracted into its own script translate-contributors.py. (2) When we look for candidate replacements for invalid author names, we should look for the assignees of the associated JIRAs too. As a result, the intermediate file must keep track of these. (3) This provides an interactive mode with which the user can sit at the terminal and manually pick the candidate replacement that he/she thinks makes the most sense. As before, there is a non-interactive mode that picks the first candidate that the script considers "valid." TODO: We should have a known_contributors file that stores known mappings so we don't have to go through all of this translation every time. This is also valuable because some contributors simply cannot be automatically translated.
* [SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + ↵Joseph K. Bradley2014-12-0419-182/+1140
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DecisionTree API fix Major changes: * Added programming guide sections for tree ensembles * Added examples for tree ensembles * Updated DecisionTree programming guide with more info on parameters * **API change**: Standardized the tree parameter for the number of classes (for classification) Minor changes: * Updated decision tree documentation * Updated existing tree and tree ensemble examples * Use train/test split, and compute test error instead of training error. * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix) Note: I know this is a lot of lines, but most is covered by: * Programming guide sections for gradient boosting and random forests. (The changes are probably best viewed by generating the docs locally.) * New examples (which were copied from the programming guide) * The "numClasses" renaming I have run all examples and relevant unit tests. CC: mengxr manishamde codedeft Author: Joseph K. Bradley <joseph@databricks.com> Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #3461 from jkbradley/ensemble-docs and squashes the following commits: 70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide 8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide 6fab846 [Joseph K. Bradley] small fixes based on review b9f8576 [Joseph K. Bradley] updated decision tree doc 375204c [Joseph K. Bradley] fixed python style 2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file. added header. Fixed small bug in same example in the programming guide. 706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small c76c823 [Joseph K. Bradley] added migration guide for mllib abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder 07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification). cdfdfbc [Joseph K. Bradley] added examples for GBT 6372a2b [Joseph K. Bradley] updated decision tree examples to use random split. tested all of them. ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide. still need to update their examples
* [SPARK-4711] [mllib] [docs] Programming guide advice on choosing optimizerJoseph K. Bradley2014-12-042-9/+18
| | | | | | | | | | | | | | I have heard requests for the docs to include advice about choosing an optimization method. The programming guide could include a brief statement about this (so the user does not have to read the whole optimization section). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #3569 from jkbradley/lr-doc and squashes the following commits: 654aeb5 [Joseph K. Bradley] updated section header for mllib-optimization 5035ad0 [Joseph K. Bradley] updated based on review 94f6dec [Joseph K. Bradley] Updated linear methods and optimization docs with quick advice on choosing an optimization method
* [SPARK-4085] Propagate FetchFailedException when Spark fails to read local ↵Reynold Xin2014-12-033-13/+40
| | | | | | | | | | | | | | | | shuffle file. cc aarondav kayousterhout pwendell This should go into 1.2? Author: Reynold Xin <rxin@databricks.com> Closes #3579 from rxin/SPARK-4085 and squashes the following commits: 255b4fd [Reynold Xin] Updated test. f9814d9 [Reynold Xin] Code review feedback. 2afaf35 [Reynold Xin] [SPARK-4085] Propagate FetchFailedException when Spark fails to read local shuffle file.
* [SPARK-4498][core] Don't transition ExecutorInfo to RUNNING until Driver ↵Mark Hamstra2014-12-032-2/+1
| | | | | | | | | | | | | | adds Executor The ExecutorInfo only reaches the RUNNING state if the Driver is alive to send the ExecutorStateChanged message to master. Else, appInfo.resetRetryCount() is never called and failing Executors will eventually exceed ApplicationState.MAX_NUM_RETRY, resulting in the application being removed from the master's accounting. JoshRosen Author: Mark Hamstra <markhamstra@gmail.com> Closes #3550 from markhamstra/SPARK-4498 and squashes the following commits: 8f543b1 [Mark Hamstra] Don't transition ExecutorInfo to RUNNING until Executor is added by Driver
* [SPARK-4552][SQL] Avoid exception when reading empty parquet data through HiveMichael Armbrust2014-12-033-45/+62
| | | | | | | | | | | This is a very small fix that catches one specific exception and returns an empty table. #3441 will address this in a more principled way. Author: Michael Armbrust <michael@databricks.com> Closes #3586 from marmbrus/fixEmptyParquet and squashes the following commits: 2781d9f [Michael Armbrust] Handle empty lists for newParquet 04dd376 [Michael Armbrust] Avoid exception when reading empty parquet data through Hive
* [HOT FIX] [YARN] Check whether `/lib` exists before listing its filesAndrew Or2014-12-031-12/+15
| | | | | | | | | | This is caused by a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53 Author: Andrew Or <andrew@databricks.com> Closes #3589 from andrewor14/yarn-hot-fix and squashes the following commits: a4fad5f [Andrew Or] Check whether lib directory exists before listing its files
* [SPARK-4642] Add description about spark.yarn.queue to running-on-YARN document.Masayoshi TSUZUKI2014-12-031-1/+8
| | | | | | | | | | | | | | | | Added descriptions about these parameters. - spark.yarn.queue Modified description about the defalut value of this parameter. - spark.yarn.submit.file.replication Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3500 from tsudukim/feature/SPARK-4642 and squashes the following commits: ce99655 [Masayoshi TSUZUKI] better gramatically. 21cf624 [Masayoshi TSUZUKI] Removed intentionally undocumented properties. 88cac9b [Masayoshi TSUZUKI] [SPARK-4642] Documents about running-on-YARN needs update
* [SPARK-4715][Core] Make sure tryToAcquire won't return a negative valuezsxwing2014-12-032-3/+19
| | | | | | | | | | ShuffleMemoryManager.tryToAcquire may return a negative value. The unit test demonstrates this bug. It will output `0 did not equal -200 granted is negative`. Author: zsxwing <zsxwing@gmail.com> Closes #3575 from zsxwing/SPARK-4715 and squashes the following commits: a193ae6 [zsxwing] Make sure tryToAcquire won't return a negative value
* [SPARK-4701] Typo in sbt/sbtMasayoshi TSUZUKI2014-12-031-2/+2
| | | | | | | | | | | Modified typo. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3560 from tsudukim/feature/SPARK-4701 and squashes the following commits: ed2a3f1 [Masayoshi TSUZUKI] Another whitespace position error. 1af3a35 [Masayoshi TSUZUKI] [SPARK-4701] Typo in sbt/sbt
* SPARK-2624 add datanucleus jars to the container in yarn-clusterJim Lim2014-12-033-0/+157
| | | | | | | | | | | | | | | | | If `spark-submit` finds the datanucleus jars, it adds them to the driver's classpath, but does not add it to the container. This patch modifies the yarn deployment class to copy all `datanucleus-*` jars found in `[spark-home]/libs` to the container. Author: Jim Lim <jim@quixey.com> Closes #3238 from jimjh/SPARK-2624 and squashes the following commits: 3633071 [Jim Lim] SPARK-2624 update documentation and comments fe95125 [Jim Lim] SPARK-2624 keep java imports together 6c31fe0 [Jim Lim] SPARK-2624 update documentation 6690fbf [Jim Lim] SPARK-2624 add tests d28d8e9 [Jim Lim] SPARK-2624 add spark.yarn.datanucleus.dir option 84e6cba [Jim Lim] SPARK-2624 add datanucleus jars to the container in yarn-cluster
* [SPARK-4717][MLlib] Optimize BLAS library to avoid de-reference multiple ↵DB Tsai2014-12-031-39/+60
| | | | | | | | | | | | | | | times in loop Have a local reference to `values` and `indices` array in the `Vector` object so JVM can locate the value with one operation call. See `SPARK-4581` for similar optimization, and the bytecode analysis. Author: DB Tsai <dbtsai@alpinenow.com> Closes #3577 from dbtsai/blasopt and squashes the following commits: 62d38c4 [DB Tsai] formating 0316cef [DB Tsai] first commit
* [SPARK-4708][MLLib] Make k-mean runs two/three times faster with ↵DB Tsai2014-12-035-68/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | dense/sparse sample Note that the usage of `breezeSquaredDistance` in `org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical path, and `breezeSquaredDistance` is slow. We should replace it with our own implementation. Here is the benchmark against mnist8m dataset. Before DenseVector: 70.04secs SparseVector: 59.05secs With this PR DenseVector: 30.58secs SparseVector: 21.14secs Author: DB Tsai <dbtsai@alpinenow.com> Closes #3565 from dbtsai/kmean and squashes the following commits: 08bc068 [DB Tsai] restyle de24662 [DB Tsai] address feedback b185a77 [DB Tsai] cleanup 4554ddd [DB Tsai] first commit
* [SPARK-4710] [mllib] Eliminate MLlib compilation warningsJoseph K. Bradley2014-12-032-8/+10
| | | | | | | | | | | | | | Renamed StreamingKMeans to StreamingKMeansExample to avoid warning about name conflict with StreamingKMeans class. Added import to DecisionTreeRunner to eliminate warning. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #3568 from jkbradley/ml-compilation-warnings and squashes the following commits: 64d6bc4 [Joseph K. Bradley] Updated DecisionTreeRunner.scala and StreamingKMeans.scala to eliminate compilation warnings, including renaming StreamingKMeans to StreamingKMeansExample.
* [SPARK-4397][Core] Change the 'since' value of '@deprecated' to '1.3.0'zsxwing2014-12-031-18/+18
| | | | | | | | | | As #3262 wasn't merged to branch 1.2, the `since` value of `deprecated` should be '1.3.0'. Author: zsxwing <zsxwing@gmail.com> Closes #3573 from zsxwing/SPARK-4397-version and squashes the following commits: 1daa03c [zsxwing] Change the 'since' value to '1.3.0'
* [SPARK-4672][Core]Checkpoint() should clear f to shorten the serialization chainJerryLead2014-12-021-3/+6
| | | | | | | | | | | | | | | The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672 The f closure of `PartitionsRDD(ZippedPartitionsRDD2)` contains a `$outer` that references EdgeRDD/VertexRDD, which causes task's serialization chain become very long in iterative GraphX applications. As a result, StackOverflow error will occur. If we set "f = null" in `clearDependencies()`, checkpoint() can cut off the long serialization chain. More details and explanation can be found in the JIRA. Author: JerryLead <JerryLead@163.com> Author: Lijie Xu <csxulijie@gmail.com> Closes #3545 from JerryLead/my_core and squashes the following commits: f7faea5 [JerryLead] checkpoint() should clear the f to avoid StackOverflow error c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark 52799e3 [Lijie Xu] Merge pull request #1 from apache/master
* [SPARK-4672][GraphX]Non-transient PartitionsRDDs will lead to StackOverflow ↵JerryLead2014-12-022-2/+2
| | | | | | | | | | | | | | | | | error The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672 In a nutshell, if `val partitionsRDD` in EdgeRDDImpl and VertexRDDImpl are non-transient, the serialization chain can become very long in iterative algorithms and finally lead to the StackOverflow error. More details and explanation can be found in the JIRA. Author: JerryLead <JerryLead@163.com> Author: Lijie Xu <csxulijie@gmail.com> Closes #3544 from JerryLead/my_graphX and squashes the following commits: 628f33c [JerryLead] set PartitionsRDD to be transient in EdgeRDDImpl and VertexRDDImpl c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark 52799e3 [Lijie Xu] Merge pull request #1 from apache/master
* [SPARK-4672][GraphX]Perform checkpoint() on PartitionsRDD to shorten the lineageJerryLead2014-12-022-0/+8
| | | | | | | | | | | | | | | | The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672 Iterative GraphX applications always have long lineage, while checkpoint() on EdgeRDD and VertexRDD themselves cannot shorten the lineage. In contrast, if we perform checkpoint() on their ParitionsRDD, the long lineage can be cut off. Moreover, the existing operations such as cache() in this code is performed on the PartitionsRDD, so checkpoint() should do the same way. More details and explanation can be found in the JIRA. Author: JerryLead <JerryLead@163.com> Author: Lijie Xu <csxulijie@gmail.com> Closes #3549 from JerryLead/my_graphX_checkpoint and squashes the following commits: d1aa8d8 [JerryLead] Perform checkpoint() on PartitionsRDD not VertexRDD and EdgeRDD themselves ff08ed4 [JerryLead] Merge branch 'master' of https://github.com/apache/spark c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark 52799e3 [Lijie Xu] Merge pull request #1 from apache/master
* [Release] Translate unknown author names automaticallyAndrew Or2014-12-022-18/+111
|