aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Revert "HOTFIX: Rolling back incorrect version change"Patrick Wendell2014-12-041-1/+1
| | | | This reverts commit 3a4609eada2ee0bfbcce0f4127b6a5363ae528e5.
* [SPARK-4683][SQL] Add a beeline.cmd to run on WindowsCheng Lian2014-12-041-0/+21
| | | | | | | | | | | | | | | | | | | | | Tested locally with a Win7 VM. Connected to a Spark SQL Thrift server instance running on Mac OS X with the following command line: ``` bin\beeline.cmd -u jdbc:hive2://10.0.2.2:10000 -n lian ``` <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3599) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3599 from liancheng/beeline.cmd and squashes the following commits: 79092e7 [Cheng Lian] Windows script for BeeLine (cherry picked from commit 28c7acacef974fdabd2b9ecc20d0d6cf6c58728f) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [FIX][DOC] Fix broken links in ml-guide.mdXiangrui Meng2014-12-044-7/+5
| | | | | | | | | | | | | | | and some minor changes in ScalaDoc. Author: Xiangrui Meng <meng@databricks.com> Closes #3601 from mengxr/SPARK-4575-fix and squashes the following commits: c559768 [Xiangrui Meng] minor code update ce94da8 [Xiangrui Meng] Java Bean -> JavaBean 0b5c182 [Xiangrui Meng] fix links in ml-guide (cherry picked from commit 7e758d709286e73d2c878d4a2d2b4606386142c7) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixesJoseph K. Bradley2014-12-0417-24/+1205
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Documentation: * Added ml-guide.md, linked from mllib-guide.md * Updated mllib-guide.md with small section pointing to ml-guide.md Examples: * CrossValidatorExample * SimpleParamsExample * (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md) Bug fixes: * PipelineModel: did not use ParamMaps correctly * UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!) CC: mengxr shivaram etrain Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete. Author: Joseph K. Bradley <joseph@databricks.com> Author: jkbradley <joseph.kurata.bradley@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3588 from jkbradley/ml-package-docs and squashes the following commits: d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit). updated examples for CV and Params for spark.ml c38469c [Joseph K. Bradley] Updated ml-guide with CV examples 99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params. Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold. ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs 3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype 41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version. CrossValidatorExample not working yet. Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works. (cherry picked from commit 469a6e5f3bdd5593b3254bc916be8236e7c6cb74) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [docs] Fix outdated comment in tuning guideJoseph K. Bradley2014-12-041-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When you use the SPARK_JAVA_OPTS env variable, Spark complains: ``` SPARK_JAVA_OPTS was detected (set to ' -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps '). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with conf/spark-defaults.conf to set defaults for an application - ./spark-submit with --driver-java-options to set -X options for a driver - spark.executor.extraJavaOptions to set -X options for executors - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker) ``` This updates the docs to redirect the user to the relevant part of the configuration docs. CC: mengxr but please CC someone else as needed Author: Joseph K. Bradley <joseph@databricks.com> Closes #3592 from jkbradley/tuning-doc and squashes the following commits: 0760ce1 [Joseph K. Bradley] fixed outdated comment in tuning guide (cherry picked from commit 529439bd506949f272a2b6f099ea549b097428f3) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SQL] Minor: Avoid calling Seq#size in a loopAaron Davidson2014-12-041-3/+3
| | | | | | | | | | | | | Just found this instance while doing some jstack-based profiling of a Spark SQL job. It is very unlikely that this is causing much of a perf issue anywhere, but it is unnecessarily suboptimal. Author: Aaron Davidson <aaron@databricks.com> Closes #3593 from aarondav/seq-opt and squashes the following commits: 962cdfc [Aaron Davidson] [SQL] Minor: Avoid calling Seq#size in a loop (cherry picked from commit c6c7165e7ecf1690027d6bd4e0620012cd0d2310) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-4685] Include all spark.ml and spark.mllib packages in JavaDoc's ↵lewuathe2014-12-041-1/+4
| | | | | | | | | | | | | | | | | | | | | MLlib group This is #3554 from Lewuathe except that I put both `spark.ml` and `spark.mllib` in the group 'MLlib`. Closes #3554 jkbradley Author: lewuathe <lewuathe@me.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3598 from mengxr/Lewuathe-modify-javadoc-setting and squashes the following commits: 184609a [Xiangrui Meng] merge spark.ml and spark.mllib into the same group in javadoc f7535e6 [lewuathe] [SPARK-4685] Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections (cherry picked from commit 20bfea4ab7c0923e8d3f039d0c5098669db4d5b0) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [Release] Correctly translate contributors name in release notesAndrew Or2014-12-034-56/+230
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit involves three main changes: (1) It separates the translation of contributor names from the generation of the contributors list. This is largely motivated by the Github API limit; even if we exceed this limit, we should at least be able to proceed manually as before. This is why the translation logic is abstracted into its own script translate-contributors.py. (2) When we look for candidate replacements for invalid author names, we should look for the assignees of the associated JIRAs too. As a result, the intermediate file must keep track of these. (3) This provides an interactive mode with which the user can sit at the terminal and manually pick the candidate replacement that he/she thinks makes the most sense. As before, there is a non-interactive mode that picks the first candidate that the script considers "valid." TODO: We should have a known_contributors file that stores known mappings so we don't have to go through all of this translation every time. This is also valuable because some contributors simply cannot be automatically translated. Conflicts: .gitignore
* [SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + ↵Joseph K. Bradley2014-12-0419-182/+1140
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DecisionTree API fix Major changes: * Added programming guide sections for tree ensembles * Added examples for tree ensembles * Updated DecisionTree programming guide with more info on parameters * **API change**: Standardized the tree parameter for the number of classes (for classification) Minor changes: * Updated decision tree documentation * Updated existing tree and tree ensemble examples * Use train/test split, and compute test error instead of training error. * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix) Note: I know this is a lot of lines, but most is covered by: * Programming guide sections for gradient boosting and random forests. (The changes are probably best viewed by generating the docs locally.) * New examples (which were copied from the programming guide) * The "numClasses" renaming I have run all examples and relevant unit tests. CC: mengxr manishamde codedeft Author: Joseph K. Bradley <joseph@databricks.com> Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #3461 from jkbradley/ensemble-docs and squashes the following commits: 70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide 8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide 6fab846 [Joseph K. Bradley] small fixes based on review b9f8576 [Joseph K. Bradley] updated decision tree doc 375204c [Joseph K. Bradley] fixed python style 2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file. added header. Fixed small bug in same example in the programming guide. 706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small c76c823 [Joseph K. Bradley] added migration guide for mllib abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder 07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification). cdfdfbc [Joseph K. Bradley] added examples for GBT 6372a2b [Joseph K. Bradley] updated decision tree examples to use random split. tested all of them. ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide. still need to update their examples (cherry picked from commit 657a88835d8bf22488b53d50f75281d7dc32442e) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-4711] [mllib] [docs] Programming guide advice on choosing optimizerJoseph K. Bradley2014-12-042-9/+18
| | | | | | | | | | | | | | | | | I have heard requests for the docs to include advice about choosing an optimization method. The programming guide could include a brief statement about this (so the user does not have to read the whole optimization section). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #3569 from jkbradley/lr-doc and squashes the following commits: 654aeb5 [Joseph K. Bradley] updated section header for mllib-optimization 5035ad0 [Joseph K. Bradley] updated based on review 94f6dec [Joseph K. Bradley] Updated linear methods and optimization docs with quick advice on choosing an optimization method (cherry picked from commit 27ab0b8a03b711e8d86b6167df833f012205ccc7) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-4085] Propagate FetchFailedException when Spark fails to read local ↵Reynold Xin2014-12-033-13/+40
| | | | | | | | | | | | | | | | | | | shuffle file. cc aarondav kayousterhout pwendell This should go into 1.2? Author: Reynold Xin <rxin@databricks.com> Closes #3579 from rxin/SPARK-4085 and squashes the following commits: 255b4fd [Reynold Xin] Updated test. f9814d9 [Reynold Xin] Code review feedback. 2afaf35 [Reynold Xin] [SPARK-4085] Propagate FetchFailedException when Spark fails to read local shuffle file. (cherry picked from commit 1826372d0a1bc80db9015106dd5d2d155ada33f5) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-4498][core] Don't transition ExecutorInfo to RUNNING until Driver ↵Mark Hamstra2014-12-032-2/+1
| | | | | | | | | | | | adds Executor The ExecutorInfo only reaches the RUNNING state if the Driver is alive to send the ExecutorStateChanged message to master. Else, appInfo.resetRetryCount() is never called and failing Executors will eventually exceed ApplicationState.MAX_NUM_RETRY, resulting in the application being removed from the master's accounting. Author: Mark Hamstra <markhamstra@gmail.com> Closes #3550 from markhamstra/SPARK-4498 and squashes the following commits: 8f543b1 [Mark Hamstra] Don't transition ExecutorInfo to RUNNING until Executor is added by Driver
* [SPARK-4552][SQL] Avoid exception when reading empty parquet data through HiveMichael Armbrust2014-12-033-45/+62
| | | | | | | | | | | | | | This is a very small fix that catches one specific exception and returns an empty table. #3441 will address this in a more principled way. Author: Michael Armbrust <michael@databricks.com> Closes #3586 from marmbrus/fixEmptyParquet and squashes the following commits: 2781d9f [Michael Armbrust] Handle empty lists for newParquet 04dd376 [Michael Armbrust] Avoid exception when reading empty parquet data through Hive (cherry picked from commit 513ef82e85661552e596d0b483b645ac24e86d4d) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [HOT FIX] [YARN] Check whether `/lib` exists before listing its filesAndrew Or2014-12-031-12/+15
| | | | | | | | | | | | | This is caused by a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53 Author: Andrew Or <andrew@databricks.com> Closes #3589 from andrewor14/yarn-hot-fix and squashes the following commits: a4fad5f [Andrew Or] Check whether lib directory exists before listing its files (cherry picked from commit 90ec643e9af4c8bbb9000edca08c07afb17939c7) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-4642] Add description about spark.yarn.queue to running-on-YARN document.Masayoshi TSUZUKI2014-12-031-1/+8
| | | | | | | | | | | | | | | | | | | Added descriptions about these parameters. - spark.yarn.queue Modified description about the defalut value of this parameter. - spark.yarn.submit.file.replication Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3500 from tsudukim/feature/SPARK-4642 and squashes the following commits: ce99655 [Masayoshi TSUZUKI] better gramatically. 21cf624 [Masayoshi TSUZUKI] Removed intentionally undocumented properties. 88cac9b [Masayoshi TSUZUKI] [SPARK-4642] Documents about running-on-YARN needs update (cherry picked from commit 692f49378f7d384d5c9c5ab7451a1c1e66f91c50) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-4715][Core] Make sure tryToAcquire won't return a negative valuezsxwing2014-12-032-3/+19
| | | | | | | | | | | | | ShuffleMemoryManager.tryToAcquire may return a negative value. The unit test demonstrates this bug. It will output `0 did not equal -200 granted is negative`. Author: zsxwing <zsxwing@gmail.com> Closes #3575 from zsxwing/SPARK-4715 and squashes the following commits: a193ae6 [zsxwing] Make sure tryToAcquire won't return a negative value (cherry picked from commit edd3cd477c9d6016bd977c2fa692fdeff5a6e198) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-4701] Typo in sbt/sbtMasayoshi TSUZUKI2014-12-031-2/+2
| | | | | | | | | | | | | | Modified typo. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3560 from tsudukim/feature/SPARK-4701 and squashes the following commits: ed2a3f1 [Masayoshi TSUZUKI] Another whitespace position error. 1af3a35 [Masayoshi TSUZUKI] [SPARK-4701] Typo in sbt/sbt (cherry picked from commit 96786e3ee53a13a57463b74bec0e77b172f719a3) Signed-off-by: Andrew Or <andrew@databricks.com>
* SPARK-2624 add datanucleus jars to the container in yarn-clusterJim Lim2014-12-033-0/+157
| | | | | | | | | | | | | | | | | If `spark-submit` finds the datanucleus jars, it adds them to the driver's classpath, but does not add it to the container. This patch modifies the yarn deployment class to copy all `datanucleus-*` jars found in `[spark-home]/libs` to the container. Author: Jim Lim <jim@quixey.com> Closes #3238 from jimjh/SPARK-2624 and squashes the following commits: 3633071 [Jim Lim] SPARK-2624 update documentation and comments fe95125 [Jim Lim] SPARK-2624 keep java imports together 6c31fe0 [Jim Lim] SPARK-2624 update documentation 6690fbf [Jim Lim] SPARK-2624 add tests d28d8e9 [Jim Lim] SPARK-2624 add spark.yarn.datanucleus.dir option 84e6cba [Jim Lim] SPARK-2624 add datanucleus jars to the container in yarn-cluster
* [SPARK-4717][MLlib] Optimize BLAS library to avoid de-reference multiple ↵DB Tsai2014-12-031-39/+60
| | | | | | | | | | | | | | | | | | times in loop Have a local reference to `values` and `indices` array in the `Vector` object so JVM can locate the value with one operation call. See `SPARK-4581` for similar optimization, and the bytecode analysis. Author: DB Tsai <dbtsai@alpinenow.com> Closes #3577 from dbtsai/blasopt and squashes the following commits: 62d38c4 [DB Tsai] formating 0316cef [DB Tsai] first commit (cherry picked from commit d00542987ed80635782dcc826fc0bdbf434fff10) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-4708][MLLib] Make k-mean runs two/three times faster with ↵DB Tsai2014-12-035-68/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dense/sparse sample Note that the usage of `breezeSquaredDistance` in `org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical path, and `breezeSquaredDistance` is slow. We should replace it with our own implementation. Here is the benchmark against mnist8m dataset. Before DenseVector: 70.04secs SparseVector: 59.05secs With this PR DenseVector: 30.58secs SparseVector: 21.14secs Author: DB Tsai <dbtsai@alpinenow.com> Closes #3565 from dbtsai/kmean and squashes the following commits: 08bc068 [DB Tsai] restyle de24662 [DB Tsai] address feedback b185a77 [DB Tsai] cleanup 4554ddd [DB Tsai] first commit (cherry picked from commit 7fc49ed91168999d24ae7b4cc46fbb4ec87febc1) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-4710] [mllib] Eliminate MLlib compilation warningsJoseph K. Bradley2014-12-032-8/+10
| | | | | | | | | | | | | | | | | Renamed StreamingKMeans to StreamingKMeansExample to avoid warning about name conflict with StreamingKMeans class. Added import to DecisionTreeRunner to eliminate warning. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #3568 from jkbradley/ml-compilation-warnings and squashes the following commits: 64d6bc4 [Joseph K. Bradley] Updated DecisionTreeRunner.scala and StreamingKMeans.scala to eliminate compilation warnings, including renaming StreamingKMeans to StreamingKMeansExample. (cherry picked from commit 4ac21511547dc6227d05bf61821cd2d9ab5ede74) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-4672][Core]Checkpoint() should clear f to shorten the serialization chainJerryLead2014-12-021-3/+6
| | | | | | | | | | | | | | | | | | The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672 The f closure of `PartitionsRDD(ZippedPartitionsRDD2)` contains a `$outer` that references EdgeRDD/VertexRDD, which causes task's serialization chain become very long in iterative GraphX applications. As a result, StackOverflow error will occur. If we set "f = null" in `clearDependencies()`, checkpoint() can cut off the long serialization chain. More details and explanation can be found in the JIRA. Author: JerryLead <JerryLead@163.com> Author: Lijie Xu <csxulijie@gmail.com> Closes #3545 from JerryLead/my_core and squashes the following commits: f7faea5 [JerryLead] checkpoint() should clear the f to avoid StackOverflow error c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark 52799e3 [Lijie Xu] Merge pull request #1 from apache/master (cherry picked from commit 77be8b986fd21b7bbe28aa8db1042cb22bc74fe7) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
* [SPARK-4672][GraphX]Non-transient PartitionsRDDs will lead to StackOverflow ↵JerryLead2014-12-022-2/+2
| | | | | | | | | | | | | | | | | | | | error The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672 In a nutshell, if `val partitionsRDD` in EdgeRDDImpl and VertexRDDImpl are non-transient, the serialization chain can become very long in iterative algorithms and finally lead to the StackOverflow error. More details and explanation can be found in the JIRA. Author: JerryLead <JerryLead@163.com> Author: Lijie Xu <csxulijie@gmail.com> Closes #3544 from JerryLead/my_graphX and squashes the following commits: 628f33c [JerryLead] set PartitionsRDD to be transient in EdgeRDDImpl and VertexRDDImpl c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark 52799e3 [Lijie Xu] Merge pull request #1 from apache/master (cherry picked from commit 17c162f6682520e6e2790626e37da3a074471793) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
* [SPARK-4672][GraphX]Perform checkpoint() on PartitionsRDD to shorten the lineageJerryLead2014-12-022-0/+8
| | | | | | | | | | | | | | | | | | | The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672 Iterative GraphX applications always have long lineage, while checkpoint() on EdgeRDD and VertexRDD themselves cannot shorten the lineage. In contrast, if we perform checkpoint() on their ParitionsRDD, the long lineage can be cut off. Moreover, the existing operations such as cache() in this code is performed on the PartitionsRDD, so checkpoint() should do the same way. More details and explanation can be found in the JIRA. Author: JerryLead <JerryLead@163.com> Author: Lijie Xu <csxulijie@gmail.com> Closes #3549 from JerryLead/my_graphX_checkpoint and squashes the following commits: d1aa8d8 [JerryLead] Perform checkpoint() on PartitionsRDD not VertexRDD and EdgeRDD themselves ff08ed4 [JerryLead] Merge branch 'master' of https://github.com/apache/spark c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark 52799e3 [Lijie Xu] Merge pull request #1 from apache/master (cherry picked from commit fc0a1475ef7c8b33363d88adfe8e8f28def5afc7) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
* [Release] Translate unknown author names automaticallyAndrew Or2014-12-022-18/+111
|
* [SPARK-4695][SQL] Get result using executeCollectwangfei2014-12-021-1/+3
| | | | | | | | | | | | | | | | Using ```executeCollect``` to collect the result, because executeCollect is a custom implementation of collect in spark sql which better than rdd's collect Author: wangfei <wangfei1@huawei.com> Closes #3547 from scwf/executeCollect and squashes the following commits: a5ab68e [wangfei] Revert "adding debug info" a60d680 [wangfei] fix test failure 0db7ce8 [wangfei] adding debug info 184c594 [wangfei] using executeCollect instead collect (cherry picked from commit 3ae0cda83c5106136e90d59c20e61db345a5085f) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-4670] [SQL] wrong symbol for bitwise notDaoyuan Wang2014-12-022-10/+25
| | | | | | | | | | | | | | | We should use `~` instead of `-` for bitwise NOT. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3528 from adrian-wang/symbol and squashes the following commits: affd4ad [Daoyuan Wang] fix code gen test case 56efb79 [Daoyuan Wang] ensure bitwise NOT over byte and short persist data type f55fbae [Daoyuan Wang] wrong symbol for bitwise not (cherry picked from commit 1f5ddf17e831ad9717f0f4b60a727a3381fad4f9) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-4593][SQL] Return null when denominator is 0Daoyuan Wang2014-12-024-5/+83
| | | | | | | | | | | | | | | | | | | | | | | | SELECT max(1/0) FROM src would return a very large number, which is obviously not right. For hive-0.12, hive would return `Infinity` for 1/0, while for hive-0.13.1, it is `NULL` for 1/0. I think it is better to keep our behavior with newer Hive version. This PR ensures that when the divider is 0, the result of expression should be NULL, same with hive-0.13.1 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3443 from adrian-wang/div and squashes the following commits: 2e98677 [Daoyuan Wang] fix code gen for divide 0 85c28ba [Daoyuan Wang] temp 36236a5 [Daoyuan Wang] add test cases 6f5716f [Daoyuan Wang] fix comments cee92bd [Daoyuan Wang] avoid evaluation 2 times 22ecd9a [Daoyuan Wang] fix style cf28c58 [Daoyuan Wang] divide fix 2dfe50f [Daoyuan Wang] return null when divider is 0 of Double type (cherry picked from commit f6df609dcc4f4a18c0f1c74b1ae0800cf09fa7ae) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-4676][SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql ↵YanTangZhai2014-12-025-0/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | has null val jsc = new org.apache.spark.api.java.JavaSparkContext(sc) val jhc = new org.apache.spark.sql.hive.api.java.JavaHiveContext(jsc) val nrdd = jhc.hql("select null from spark_test.for_test") println(nrdd.schema) Then the error is thrown as follows: scala.MatchError: NullType (of class org.apache.spark.sql.catalyst.types.NullType$) at org.apache.spark.sql.types.util.DataTypeConversions$.asJavaDataType(DataTypeConversions.scala:43) Author: YanTangZhai <hakeemzhai@tencent.com> Author: yantangzhai <tyz0303@163.com> Author: Michael Armbrust <michael@databricks.com> Closes #3538 from YanTangZhai/MatchNullType and squashes the following commits: e052dff [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null 4b4bb34 [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null 896c7b7 [yantangzhai] fix NullType MatchError in JavaSchemaRDD when sql has null 6e643f8 [YanTangZhai] Merge pull request #11 from apache/master e249846 [YanTangZhai] Merge pull request #10 from apache/master d26d982 [YanTangZhai] Merge pull request #9 from apache/master 76d4027 [YanTangZhai] Merge pull request #8 from apache/master 03b62b0 [YanTangZhai] Merge pull request #7 from apache/master 8a00106 [YanTangZhai] Merge pull request #6 from apache/master cbcba66 [YanTangZhai] Merge pull request #3 from apache/master cdef539 [YanTangZhai] Merge pull request #1 from apache/master (cherry picked from commit 10664276007beca3843638e558f504cad44b1fb3) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-4663][sql]add finally to avoid resource leakbaishuo2014-12-021-4/+7
| | | | | | | | | | | | | Author: baishuo <vc_java@hotmail.com> Closes #3526 from baishuo/master-trycatch and squashes the following commits: d446e14 [baishuo] correct the code style b36bf96 [baishuo] correct the code style ae0e447 [baishuo] add finally to avoid resource leak (cherry picked from commit 69b6fed206565ecb0173d3757bcb5110422887c3) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-4536][SQL] Add sqrt and abs to Spark SQL DSLKousuke Saruta2014-12-024-1/+74
| | | | | | | | | | | | | | | | Spark SQL has embeded sqrt and abs but DSL doesn't support those functions. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3401 from sarutak/dsl-missing-operator and squashes the following commits: 07700cf [Kousuke Saruta] Modified Literal(null, NullType) to Literal(null) in DslQuerySuite 8f366f8 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into dsl-missing-operator 1b88e2e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into dsl-missing-operator 0396f89 [Kousuke Saruta] Added sqrt and abs to Spark SQL DSL (cherry picked from commit e75e04f980281389b881df76f59ba1adc6338629) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-4686] Link to allowed master URLs is brokenKay Ousterhout2014-12-021-1/+1
| | | | | | | | | | | | | | | The link points to the old scala programming guide; it should point to the submitting applications page. This should be backported to 1.1.2 (it's been broken as of 1.0). Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #3542 from kayousterhout/SPARK-4686 and squashes the following commits: a8fc43b [Kay Ousterhout] [SPARK-4686] Link to allowed master URLs is broken (cherry picked from commit d9a148ba6a67a01e4bf77c35c41dd4cbc8918c82) Signed-off-by: Kay Ousterhout <kayousterhout@gmail.com>
* [SPARK-4611][MLlib] Implement the efficient vector normDB Tsai2014-12-024-6/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The vector norm in breeze is implemented by `activeIterator` which is known to be very slow. In this PR, an efficient vector norm is implemented, and with this API, `Normalizer` and `k-means` have big performance improvement. Here is the benchmark against mnist8m dataset. a) `Normalizer` Before DenseVector: 68.25secs SparseVector: 17.01secs With this PR DenseVector: 12.71secs SparseVector: 2.73secs b) `k-means` Before DenseVector: 83.46secs SparseVector: 61.60secs With this PR DenseVector: 70.04secs SparseVector: 59.05secs Author: DB Tsai <dbtsai@alpinenow.com> Closes #3462 from dbtsai/norm and squashes the following commits: 63c7165 [DB Tsai] typo 0c3637f [DB Tsai] add import org.apache.spark.SparkContext._ back 6fa616c [DB Tsai] address feedback 9b7cb56 [DB Tsai] move norm to static method 0b632e6 [DB Tsai] kmeans dbed124 [DB Tsai] style c1a877c [DB Tsai] first commit (cherry picked from commit 64f3175bf976f5a28e691cedc7a4b333709e0c58) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-4529] [SQL] support view with column aliasDaoyuan Wang2014-12-012-3/+3
| | | | | | | | | | | | | | | | | | | Support view definition like CREATE VIEW view3(valoo) TBLPROPERTIES ("fear" = "factor") AS SELECT upper(value) FROM src WHERE key=86; [valoo as the alias of upper(value)]. This is missing part of SPARK-4239, for a fully view support. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3396 from adrian-wang/viewcolumn and squashes the following commits: 4d001d0 [Daoyuan Wang] support view with column alias (cherry picked from commit 4df60a8cbc58f2877787245c2a83b2de85579c82) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SQL][DOC] Date type in SQL programming guideDaoyuan Wang2014-12-011-0/+23
| | | | | | | | | | | Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3535 from adrian-wang/datedoc and squashes the following commits: 18ff1ed [Daoyuan Wang] [DOC] Date type (cherry picked from commit 5edbcbfb61703398a24ce5162a74aba04e365b0c) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SQL] Minor fix for doc and commentwangfei2014-12-013-5/+7
| | | | | | | | | | | Author: wangfei <wangfei1@huawei.com> Closes #3533 from scwf/sql-doc1 and squashes the following commits: 962910b [wangfei] doc and comment fix (cherry picked from commit 7b79957879db4dfcc7c3601cb40ac4fd576259a5) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-4658][SQL] Code documentation issue in DDL of datasource APIravipesala2014-12-012-3/+3
| | | | | | | | | | | | Author: ravipesala <ravindra.pesala@huawei.com> Closes #3516 from ravipesala/ddl_doc and squashes the following commits: d101fdf [ravipesala] Style issues fixed d2238cd [ravipesala] Corrected documentation (cherry picked from commit bc353819cc86c3b0ad75caf81b47744bfc2aeeb3) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-4650][SQL] Supporting multi column support in countDistinct function ↵ravipesala2014-12-012-1/+9
| | | | | | | | | | | | | | | | | like count(distinct c1,c2..) in Spark SQL Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL Author: ravipesala <ravindra.pesala@huawei.com> Author: Michael Armbrust <michael@databricks.com> Closes #3511 from ravipesala/countdistinct and squashes the following commits: cc4dbb1 [ravipesala] style 070e12a [ravipesala] Supporting multi column support in count(distinct c1,c2..) in Spark SQL (cherry picked from commit 6a9ff19dc06745144d5b311d4f87073c81d53a8f) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-4358][SQL] Let BigDecimal do checking type compatibilityLiang-Chi Hsieh2014-12-011-8/+3
| | | | | | | | | | | | | | | | | Remove hardcoding max and min values for types. Let BigDecimal do checking type compatibility. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3208 from viirya/more_numericLit and squashes the following commits: e9834b4 [Liang-Chi Hsieh] Remove byte and short types for number literal. 1bd1825 [Liang-Chi Hsieh] Fix Indentation and make the modification clearer. cf1a997 [Liang-Chi Hsieh] Modified for comment to add a rule of analysis that adds a cast. 91fe489 [Liang-Chi Hsieh] add Byte and Short. 1bdc69d [Liang-Chi Hsieh] Let BigDecimal do checking type compatibility. (cherry picked from commit b57365a1ec89e31470f424ff37d5ebc7c90a39d8) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SQL] add @group tab in limit() and count()Jacky Li2014-12-011-0/+4
| | | | | | | | | | | | | group tab is missing for scaladoc Author: Jacky Li <jacky.likun@gmail.com> Closes #3458 from jackylk/patch-7 and squashes the following commits: 0121a70 [Jacky Li] add @group tab in limit() and count() (cherry picked from commit bafee67ebad01f7aea2cd393a70b57eb8345eeb0) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-4258][SQL][DOC] Documents spark.sql.parquet.filterPushdownCheng Lian2014-12-011-6/+16
| | | | | | | | | | | | | | | | | Documents `spark.sql.parquet.filterPushdown`, explains why it's turned off by default and when it's safe to be turned on. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3440) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3440 from liancheng/parquet-filter-pushdown-doc and squashes the following commits: 2104311 [Cheng Lian] Documents spark.sql.parquet.filterPushdown (cherry picked from commit 5db8dcaf494e0dffed4fc22f19b0334d95ab6bfb) Signed-off-by: Michael Armbrust <michael@databricks.com>
* Documentation: add description for repartitionAndSortWithinPartitionsMadhu Siddalingaiah2014-12-011-0/+6
| | | | | | | | | | | | | | Author: Madhu Siddalingaiah <madhu@madhu.com> Closes #3390 from msiddalingaiah/master and squashes the following commits: cbccbfe [Madhu Siddalingaiah] Documentation: replace <b> with <code> (again) 332f7a2 [Madhu Siddalingaiah] Documentation: replace <b> with <code> cd2b05a [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master' 0fc12d7 [Madhu Siddalingaiah] Documentation: add description for repartitionAndSortWithinPartitions (cherry picked from commit 2b233f5fc4beb2c6ed4bc142e923e96f8bad3ec4) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [SPARK-4661][Core] Minor code and docs cleanupzsxwing2014-12-013-3/+2
| | | | | | | | | | | Author: zsxwing <zsxwing@gmail.com> Closes #3521 from zsxwing/SPARK-4661 and squashes the following commits: 03cbe3f [zsxwing] Minor code and docs cleanup (cherry picked from commit 30a86acdefd5428af6d6264f59a037e0eefd74b4) Signed-off-by: Reynold Xin <rxin@databricks.com>
* SPARK-2192 [BUILD] Examples Data Not in Binary DistributionSean Owen2014-12-011-0/+3
| | | | | | | | | | | | | Simply, add data/ to distributions. This adds about 291KB (compressed) to the tarball, FYI. Author: Sean Owen <sowen@cloudera.com> Closes #3480 from srowen/SPARK-2192 and squashes the following commits: 47688f1 [Sean Owen] Add data/ to distributions (cherry picked from commit 6384f42ab2e5c2b3e767ab4a428cda20a8ddcbe1) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [DOC] Fixes formatting typo in SQL programming guideCheng Lian2014-11-301-2/+0
| | | | | | | | | | | | | | | <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3498) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3498 from liancheng/fix-sql-doc-typo and squashes the following commits: 865ecd7 [Cheng Lian] Fixes formatting typo in SQL programming guide (cherry picked from commit 2a4d389f70b2066b1ac32b081bef44e61fefb03c) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [SPARK-4656][Doc] Typo in Programming Guide markdownlewuathe2014-11-301-1/+1
| | | | | | | | | | | | | Grammatical error in Programming Guide document Author: lewuathe <lewuathe@me.com> Closes #3412 from Lewuathe/typo-programming-guide and squashes the following commits: a3e2f00 [lewuathe] Typo in Programming Guide markdown (cherry picked from commit a217ec5fd5cd7addc69e538d6ec6dd64956cc8ed) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* SPARK-2143 [WEB UI] Add Spark version to UI footerSean Owen2014-11-301-0/+10
| | | | | | | | | | | | This PR adds the Spark version number to the UI footer; this is how it looks: ![screen shot 2014-11-21 at 22 58 40](https://cloud.githubusercontent.com/assets/822522/5157738/f4822094-7316-11e4-98f1-333a535fdcfa.png) Author: Sean Owen <sowen@cloudera.com> Closes #3410 from srowen/SPARK-2143 and squashes the following commits: e9b3a7a [Sean Owen] Add Spark version to footer
* [DOCS][BUILD] Add instruction to use change-version-to-2.11.sh in 'Building ↵Takuya UESHIN2014-11-301-0/+1
| | | | | | | | | | | | | | | for Scala 2.11'. To build with Scala 2.11, we have to execute `change-version-to-2.11.sh` before Maven execute, otherwise inter-module dependencies are broken. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #3361 from ueshin/docs/building-spark_2.11 and squashes the following commits: 1d29126 [Takuya UESHIN] Add instruction to use change-version-to-2.11.sh in 'Building for Scala 2.11'. (cherry picked from commit 0fcd24cc542040ff3555290eec7b021062e7e6ac) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-4597] Use proper exception and reset variable in Utils.createTempDir()Liang-Chi Hsieh2014-11-281-1/+1
| | | | | | | | | | | | | `File.exists()` and `File.mkdirs()` only throw `SecurityException` instead of `IOException`. Then, when an exception is thrown, `dir` should be reset too. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3449 from viirya/fix_createtempdir and squashes the following commits: 36cacbd [Liang-Chi Hsieh] Use proper exception and reset variable. (cherry picked from commit 49fe8797e64f10c574e0790b32a8c3fdc7e594a0) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* HOTFIX: Rolling back incorrect version changePatrick Wendell2014-11-281-1/+1
|