aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-9899] [SQL] Disables customized output committer when speculation is onCheng Lian2015-08-192-1/+49
| | | | | | | | | | | | | | | | | | Speculation hates direct output committer, as there are multiple corner cases that may cause data corruption and/or data loss. Please see this [PR comment] [1] for more details. [1]: https://github.com/apache/spark/pull/8191#issuecomment-131598385 Author: Cheng Lian <lian@databricks.com> Closes #8317 from liancheng/spark-9899/speculation-hates-direct-output-committer. (cherry picked from commit f3ff4c41d2e32bd0f2419d1c9c68fcd0c2593e41) Signed-off-by: Michael Armbrust <michael@databricks.com> Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala
* [SPARK-10090] [SQL] fix decimal scale of divisionDavies Liu2015-08-196-31/+157
| | | | | | | | | | | We should rounding the result of multiply/division of decimal to expected precision/scale, also check overflow. Author: Davies Liu <davies@databricks.com> Closes #8287 from davies/decimal_division. (cherry picked from commit 1f4c4fe6dfd8cc52b5fddfd67a31a77edbb1a036) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-9627] [SQL] Stops using Scala runtime reflection in DictionaryEncodingCheng Lian2015-08-192-12/+4
| | | | | | | | | | | | | `DictionaryEncoding` uses Scala runtime reflection to avoid boxing costs while building the directory array. However, this code path may hit [SI-6240] [1] and throw exception. [1]: https://issues.scala-lang.org/browse/SI-6240 Author: Cheng Lian <lian@databricks.com> Closes #8306 from liancheng/spark-9627/in-memory-cache-scala-reflection. (cherry picked from commit 21bdbe9fe69be47be562de24216a469e5ee64c7b) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-10073] [SQL] Python withColumn should replace the old columnDavies Liu2015-08-193-7/+12
| | | | | | | | | | | | | DataFrame.withColumn in Python should be consistent with the Scala one (replacing the existing column that has the same name). cc marmbrus Author: Davies Liu <davies@databricks.com> Closes #8300 from davies/with_column. (cherry picked from commit 08887369c890e0dd87eb8b34e8c32bb03307bf24) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-10087] [CORE] [BRANCH-1.5] Disable ↵Yin Huai2015-08-192-3/+3
| | | | | | | | | | | | | | | | spark.shuffle.reduceLocality.enabled by default. https://issues.apache.org/jira/browse/SPARK-10087 In some cases, when spark.shuffle.reduceLocality.enabled is enabled, we are scheduling all reducers to the same executor (the cluster has plenty of resources). Changing spark.shuffle.reduceLocality.enabled to false resolve the problem. Comments of https://github.com/apache/spark/pull/8280 provide more details of the symptom of this issue. This PR changes the default setting of `spark.shuffle.reduceLocality.enabled` to `false` for branch 1.5. Author: Yin Huai <yhuai@databricks.com> Closes #8296 from yhuai/setNumPartitionsCorrectly-branch1.5.
* [SPARK-10107] [SQL] fix NPE in format_numberDavies Liu2015-08-192-3/+3
| | | | | | | | | Author: Davies Liu <davies@databricks.com> Closes #8305 from davies/format_number. (cherry picked from commit e05da5cb5ea253e6372f648fc8203204f2a8df8d) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clusteringXiangrui Meng2015-08-199-52/+338
| | | | | | | | | | | | | | | This continues the work from #8256. I removed `since` tags from private/protected/local methods/variables (see https://github.com/apache/spark/commit/72fdeb64630470f6f46cf3eed8ffbfe83a7c4659). MechCoder Closes #8256 Author: Xiangrui Meng <meng@databricks.com> Author: Xiaoqing Wang <spark445@126.com> Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8288 from mengxr/SPARK-8918. (cherry picked from commit 5b62bef8cbf73f910513ef3b1f557aa94b384854) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-10106] [SPARKR] Add `ifelse` Column function to SparkRYu ISHIKAWA2015-08-193-1/+22
| | | | | | | | | | | | ### JIRA [[SPARK-10106] Add `ifelse` Column function to SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10106) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8303 from yu-iskw/SPARK-10106. (cherry picked from commit d898c33f774b9a3db2fb6aa8f0cb2c2ac6004b58) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
* [SPARK-10097] Adds `shouldMaximize` flag to `ml.evaluation.Evaluator`Feynman Liang2015-08-1910-22/+52
| | | | | | | | | | | | | | | | Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user. This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized. CC jkbradley Author: Feynman Liang <fliang@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8290 from feynmanliang/SPARK-10097. (cherry picked from commit 28a98464ea65aa7b35e24fca5ddaa60c2e5d53ee) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-9856] [SPARKR] Add expression functions into SparkR whose params are ↵Yu ISHIKAWA2015-08-195-6/+649
| | | | | | | | | | | | | | | | complicated I added lots of Column functinos into SparkR. And I also added `rand(seed: Int)` and `randn(seed: Int)` in Scala. Since we need such APIs for R integer type. ### JIRA [[SPARK-9856] Add expression functions into SparkR whose params are complicated - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9856) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8264 from yu-iskw/SPARK-9856-3. (cherry picked from commit 2fcb9cb9552dac1d78dcca5d4d5032b4fa6c985c) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
* [SPARK-10084] [MLLIB] [DOC] Add Python example for mllib FP-growth user guideYanbo Liang2015-08-191-23/+50
| | | | | | | | | | | | 1, Add Python example for mllib FP-growth user guide. 2, Correct mistakes of Scala and Java examples. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8279 from yanboliang/spark-10084. (cherry picked from commit 802b5b8791fc2c892810981b2479a04175aa3dcd) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-10060] [ML] [DOC] spark.ml DecisionTree user guideJoseph K. Bradley2015-08-195-13/+519
| | | | | | | | | | | | | | | New user guide section ml-decision-tree.md, including code examples. I have run all examples, including the Java ones. CC: manishamde yanboliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8244 from jkbradley/ml-dt-docs. (cherry picked from commit 39e4ebd521defdb68a0787bcd3bde6bc855f5198) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-8949] Print warnings when using preferred locations featureHan JU2015-08-191-0/+5
| | | | | | | | | | | | | | | Add warnings according to SPARK-8949 in `SparkContext` - warnings in scaladoc - log warnings when preferred locations feature is used through `SparkContext`'s constructor However I didn't found any documentation reference of this feature. Please direct me if you know any reference to this feature. Author: Han JU <ju.han.felix@gmail.com> Closes #7874 from darkjh/SPARK-8949. (cherry picked from commit 3d16a545007922ee6fa36e5f5c3959406cb46484) Signed-off-by: Sean Owen <sowen@cloudera.com>
* [SPARK-9977] [DOCS] Update documentation for StringIndexerlewuathe2015-08-191-1/+5
| | | | | | | | | | | | By using `StringIndexer`, we can obtain indexed label on new column. So a following estimator should use this new column through pipeline if it wants to use string indexed label. I think it is better to make it explicit on documentation. Author: lewuathe <lewuathe@me.com> Closes #8205 from Lewuathe/SPARK-9977. (cherry picked from commit ba2a07e2b6c5a39597b64041cd5bf342ef9631f5) Signed-off-by: Sean Owen <sowen@cloudera.com>
* [DOCS] [SQL] [PYSPARK] Fix typo in ntile functionMoussa Taifi2015-08-191-1/+1
| | | | | | | | | | | Fix typo in ntile function. Author: Moussa Taifi <moutai10@gmail.com> Closes #8261 from moutai/patch-2. (cherry picked from commit 865a3df3d578c0442c97d749c81f554b560da406) Signed-off-by: Sean Owen <sowen@cloudera.com>
* [SPARK-10070] [DOCS] Remove Guava dependencies in user guidesSean Owen2015-08-192-35/+38
| | | | | | | | | | | | | | | `Lists.newArrayList` -> `Arrays.asList` CC jkbradley feynmanliang Anybody into replacing usages of `Lists.newArrayList` in the examples / source code too? this method isn't useful in Java 7 and beyond. Author: Sean Owen <sowen@cloudera.com> Closes #8272 from srowen/SPARK-10070. (cherry picked from commit f141efeafb42b14b5fcfd9aa8c5275162042349f) Signed-off-by: Sean Owen <sowen@cloudera.com>
* Fix Broken LinkBill Chambers2015-08-191-1/+1
| | | | | | | | | | | Link was broken because it included tick marks. Author: Bill Chambers <wchambers@ischool.berkeley.edu> Closes #8302 from anabranch/patch-1. (cherry picked from commit b23c4d3ffc36e47c057360c611d8ab1a13877699) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-9967] [SPARK-10099] [STREAMING] Renamed conf ↵Tathagata Das2015-08-184-8/+8
| | | | | | | | | | | | | | | spark.streaming.backpressure.{enable-->enabled} and fixed deprecated annotations Small changes - Renamed conf spark.streaming.backpressure.{enable --> enabled} - Change Java Deprecated annotations to Scala deprecated annotation with more information. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8299 from tdas/SPARK-9967. (cherry picked from commit bc9a0e03235865d2ec33372f6400dec8c770778a) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal ↵Josh Rosen2015-08-184-16/+18
| | | | | | | | | | | | | | | | | | | | | | | accesses cacheLocs In Scala, `Seq.fill` always seems to return a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine): ```scala val numItems = 100000 val s = Seq.fill(numItems)(1) for (i <- 0 until numItems) s(i) ``` It turns out that we had a loop like this in DAGScheduler code, although it's a little tricky to spot. In `getPreferredLocsInternal`, there's a call to `getCacheLocs(rdd)(partition)`. The `getCacheLocs` call returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput. This patch fixes this by replacing `Seq` with `Array`. Author: Josh Rosen <joshrosen@databricks.com> Closes #8178 from JoshRosen/dagscheduler-perf. (cherry picked from commit 010b03ed52f35fd4d426d522f8a9927ddc579209) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-9508] GraphX Pregel docs update with new Pregel codeAlexander Ulanov2015-08-181-10/+8
| | | | | | | | | | | SPARK-9436 simplifies the Pregel code. graphx-programming-guide needs to be modified accordingly since it lists the old Pregel code Author: Alexander Ulanov <nashb@yandex.ru> Closes #7831 from avulanov/SPARK-9508-pregel-doc2. (cherry picked from commit 1c843e284818004f16c3f1101c33b510f80722e3) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-9705] [DOC] fix docs about Python versionDavies Liu2015-08-182-3/+15
| | | | | | | | | | | cc JoshRosen Author: Davies Liu <davies@databricks.com> Closes #8245 from davies/python_doc. (cherry picked from commit de3223872a217c5224ba7136604f6b7753b29108) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-10093] [SPARK-10096] [SQL] Avoid transformation on executors & fix ↵Reynold Xin2015-08-184-7/+68
| | | | | | | | | | | | | | | | | | | | | UDFs on complex types This is kind of a weird case, but given a sufficiently complex query plan (in this case a TungstenProject with an Exchange underneath), we could have NPEs on the executors due to the time when we were calling transformAllExpressions In general we should ensure that all transformations occur on the driver and not on the executors. Some reasons for avoid executor side transformations include: * (this case) Some operator constructors require state such as access to the Spark/SQL conf so doing a makeCopy on the executor can fail. * (unrelated reason for avoid executor transformations) ExprIds are calculated using an atomic integer, so you can violate their uniqueness constraint by constructing them anywhere other than the driver. This subsumes #8285. Author: Reynold Xin <rxin@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #8295 from rxin/SPARK-10096. (cherry picked from commit 1ff0580eda90f9247a5233809667f5cebaea290e) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-10095] [SQL] use public API of BigIntegerDavies Liu2015-08-183-45/+11
| | | | | | | | | | | | | | | In UnsafeRow, we use the private field of BigInteger for better performance, but it actually didn't contribute much (3% in one benchmark) to end-to-end runtime, and make it not portable (may fail on other JVM implementations). So we should use the public API instead. cc rxin Author: Davies Liu <davies@databricks.com> Closes #8286 from davies/portable_decimal. (cherry picked from commit 270ee677750a1f2adaf24b5816857194e61782ff) Signed-off-by: Davies Liu <davies.liu@gmail.com>
* [SPARK-10075] [SPARKR] Add `when` expressino function in SparkRYu ISHIKAWA2015-08-185-0/+45
| | | | | | | | | | | | | | | | | | - Add `when` and `otherwise` as `Column` methods - Add `When` as an expression function - Add `%otherwise%` infix as an alias of `otherwise` Since R doesn't support a feature like method chaining, `otherwise(when(condition, value), value)` style is a little annoying for me. If `%otherwise%` looks strange for shivaram, I can remove it. What do you think? ### JIRA [[SPARK-10075] Add `when` expressino function in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10075) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8266 from yu-iskw/SPARK-10075. (cherry picked from commit bf32c1f7f47dd907d787469f979c5859e02ce5e6) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
* [SPARK-9939] [SQL] Resorts to Java process API in CliSuite, ↵Cheng Lian2015-08-195-91/+149
| | | | | | | | | | | | | | | | | HiveSparkSubmitSuite and HiveThriftServer2 test suites Scala process API has a known bug ([SI-8768] [1]), which may be the reason why several test suites which fork sub-processes are flaky. This PR replaces Scala process API with Java process API in `CliSuite`, `HiveSparkSubmitSuite`, and `HiveThriftServer2` related test suites to see whether it fix these flaky tests. [1]: https://issues.scala-lang.org/browse/SI-8768 Author: Cheng Lian <lian@databricks.com> Closes #8168 from liancheng/spark-9939/use-java-process-api. (cherry picked from commit a5b5b936596ceb45f5f5b68bf1d6368534fb9470) Signed-off-by: Cheng Lian <lian@databricks.com>
* [SPARK-10102] [STREAMING] Fix a race condition that startReceiver may happen ↵zsxwing2015-08-181-3/+8
| | | | | | | | | | | | | | | | | before setting trackerState to Started Test failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/3305/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/stop_gracefully/ There is a race condition that setting `trackerState` to `Started` could happen after calling `startReceiver`. Then `startReceiver` won't start the receivers because it uses `! isTrackerStarted` to check if ReceiverTracker is stopping or stopped. But actually, `trackerState` is `Initialized` and will be changed to `Started` soon. Therefore, we should use `isTrackerStopping || isTrackerStopped`. Author: zsxwing <zsxwing@gmail.com> Closes #8294 from zsxwing/SPARK-9504. (cherry picked from commit 90273eff9604439a5a5853077e232d34555c67d7) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-10072] [STREAMING] BlockGenerator can deadlock when the queue of ↵Tathagata Das2015-08-181-10/+19
| | | | | | | | | | | | | | | generate blocks fills up to capacity Generated blocks are inserted into an ArrayBlockingQueue, and another thread pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now if that queue fills up to capacity (default is 10 blocks), then the inserting into queue (done in the function updateCurrentBuffer) get blocked inside a synchronized block. However, the thread that is pulling blocks from the queue uses the same lock to check the current (active or stopped) while pulling from the queue. Since the block generating threads is blocked (as the queue is full) on the lock, this thread that is supposed to drain the queue gets blocked. Ergo, deadlock. Solution: Moved blocking call to ArrayBlockingQueue outside the synchronized to prevent deadlock. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8257 from tdas/SPARK-10072. (cherry picked from commit 1aeae05bb20f01ab7ccaa62fe905a63e020074b5) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARKR] [MINOR] Get rid of a long line warningYu ISHIKAWA2015-08-181-1/+3
| | | | | | | | | | | | | | | ``` R/functions.R:74:1: style: lines should not be more than 100 characters. jc <- callJStatic("org.apache.spark.sql.functions", "lit", ifelse(class(x) == "Column", xjc, x)) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8297 from yu-iskw/minor-lint-r. (cherry picked from commit b4b35f133aecaf84f04e8e444b660a33c6b7894a) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
* Bump SparkR version string to 1.5.0Hossein2015-08-181-1/+1
| | | | | | | | | | | | | This patch is against master, but we need to apply it to 1.5 branch as well. cc shivaram and rxin Author: Hossein <hossein@databricks.com> Closes #8291 from falaki/SparkRVersion1.5. (cherry picked from commit 04e0fea79b9acfa3a3cb81dbacb08f9d287b42c3) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
* [SPARK-8473] [SPARK-9889] [ML] User guide and example code for DCTFeynman Liang2015-08-181-0/+71
| | | | | | | | | | | mengxr jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8184 from feynmanliang/SPARK-9889-DCT-docs. (cherry picked from commit badf7fa650f9801c70515907fcc26b58d7ec3143) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-10098] [STREAMING] [TEST] Cleanup active context after test in ↵Tathagata Das2015-08-181-10/+17
| | | | | | | | | | | | | FailureSuite Failures in streaming.FailureSuite can leak StreamingContext and SparkContext which fails all subsequent tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8289 from tdas/SPARK-10098. (cherry picked from commit 9108eff74a2815986fd067b273c2a344b6315405) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-10012] [ML] Missing test case for Params#arrayLengthGtlewuathe2015-08-181-0/+3
| | | | | | | | | | | Currently there is no test case for `Params#arrayLengthGt`. Author: lewuathe <lewuathe@me.com> Closes #8223 from Lewuathe/SPARK-10012. (cherry picked from commit c635a16f64c939182196b46725ef2d00ed107cca) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.treeBryan Cutler2015-08-1824-1/+157
| | | | | | | | | | | Added since tags to mllib.tree Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #7380 from BryanCutler/sinceTag-mllibTree-8924. (cherry picked from commit 1dbffba37a84c62202befd3911d25888f958191d) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-10088] [SQL] Add support for "stored as avro" in HiveQL parser.Marcelo Vanzin2015-08-182-10/+13
| | | | | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8282 from vanzin/SPARK-10088. (cherry picked from commit 492ac1facbc79ee251d45cff315598ec9935a0e2) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-10089] [SQL] Add missing golden files.Marcelo Vanzin2015-08-182-0/+503
| | | | | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8283 from vanzin/SPARK-10089. (cherry picked from commit fa41e0242f075843beff7dc600d1a6bac004bdc7) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-10080] [SQL] Fix binary incompatibility for $ column interpolationMichael Armbrust2015-08-183-11/+22
| | | | | | | | | | | Turns out that inner classes of inner objects are referenced directly, and thus moving it will break binary compatibility. Author: Michael Armbrust <michael@databricks.com> Closes #8281 from marmbrus/binaryCompat. (cherry picked from commit 80cb25b228e821a80256546a2f03f73a45cf7645) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-9574] [STREAMING] Remove unnecessary contents of ↵zsxwing2015-08-185-1/+249
| | | | | | | | | | | | | spark-streaming-XXX-assembly jars Removed contents already included in Spark assembly jar from spark-streaming-XXX-assembly jars. Author: zsxwing <zsxwing@gmail.com> Closes #8069 from zsxwing/SPARK-9574. (cherry picked from commit bf1d6614dcb8f5974e62e406d9c0f8aac52556d3) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array importPiotr Migdal2015-08-181-2/+0
| | | | | | | | | | | See https://issues.apache.org/jira/browse/SPARK-10085 Author: Piotr Migdal <pmigdal@gmail.com> Closes #8284 from stared/spark-10085. (cherry picked from commit 8bae9015b7e7b4528ca2bc5180771cb95d2aac13) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guideYanbo Liang2015-08-181-0/+28
| | | | | | | | | | | Add Python example for mllib LDAModel user guide Author: Yanbo Liang <ybliang8@gmail.com> Closes #8227 from yanboliang/spark-10032. (cherry picked from commit 747c2ba8006d5b86f3be8dfa9ace639042a35628) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression ↵Yanbo Liang2015-08-181-0/+35
| | | | | | | | | | | | | user guide Add Python examples for mllib IsotonicRegression user guide Author: Yanbo Liang <ybliang8@gmail.com> Closes #8225 from yanboliang/spark-10029. (cherry picked from commit f4fa61effe34dae2f0eab0bef57b2dee220cf92f) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-9900] [MLLIB] User guide for Association RulesFeynman Liang2015-08-183-15/+118
| | | | | | | | | | | Updates FPM user guide to include Association Rules. Author: Feynman Liang <fliang@databricks.com> Closes #8207 from feynmanliang/SPARK-9900-arules. (cherry picked from commit f5ea3912900ccdf23e2eb419a342bfe3c0c0b61b) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-9028] [ML] Add CountVectorizer as an estimator to generate ↵Yuhao Yang2015-08-184-155/+402
| | | | | | | | | | | | | | | | | | CountVectorizerModel jira: https://issues.apache.org/jira/browse/SPARK-9028 Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency. I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn. Author: Yuhao Yang <hhbyyh@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #7388 from hhbyyh/cvEstimator. (cherry picked from commit 354f4582b637fa25d3892ec2b12869db50ed83c9) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-10007] [SPARKR] Update `NAMESPACE` file in SparkR for simple ↵Yuu ISHIKAWA2015-08-181-3/+47
| | | | | | | | | | | | | | parameters functions ### JIRA [[SPARK-10007] Update `NAMESPACE` file in SparkR for simple parameters functions - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10007) Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8277 from yu-iskw/SPARK-10007. (cherry picked from commit 1968276af0f681fe51328b7dd795bd21724a5441) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
* [SPARK-8118] [SQL] Redirects Parquet JUL logger via SLF4JCheng Lian2015-08-185-43/+47
| | | | | | | | | | | | | Parquet hard coded a JUL logger which always writes to stdout. This PR redirects it via SLF4j JUL bridge handler, so that we can control Parquet logs via `log4j.properties`. This solution is inspired by https://github.com/Parquet/parquet-mr/issues/390#issuecomment-46064909. Author: Cheng Lian <lian@databricks.com> Closes #8196 from liancheng/spark-8118/redirect-parquet-jul. (cherry picked from commit 5723d26d7e677b89383de3fcf2c9a821b68a65b7) Signed-off-by: Cheng Lian <lian@databricks.com>
* [MINOR] fix the comments in IndexShuffleBlockResolverCodingCat2015-08-181-1/+1
| | | | | | | | | | | | | it might be a typo introduced at the first moment or some leftover after some renaming...... the name of the method accessing the index file is called `getBlockData` now (not `getBlockLocation` as indicated in the comments) Author: CodingCat <zhunansjtu@gmail.com> Closes #8238 from CodingCat/minor_1. (cherry picked from commit c34e9ff0eac2032283b959fe63b47cc30f28d21c) Signed-off-by: Sean Owen <sowen@cloudera.com>
* [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights publicYanbo Liang2015-08-171-2/+2
| | | | | | | | | | | Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8263 from yanboliang/mlp-public. (cherry picked from commit dd0614fd618ad28cb77aecfbd49bb319b98fdba0) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-10038] [SQL] fix bug in generated unsafe projection when there is ↵Davies Liu2015-08-172-4/+29
| | | | | | | | | | | | | | | binary in ArrayData The type for array of array in Java is slightly different than array of others. cc cloud-fan Author: Davies Liu <davies@databricks.com> Closes #8250 from davies/array_binary. (cherry picked from commit 5af3838d2e59ed83766f85634e26918baa53819f) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [MINOR] Format the comment of `translate` at `functions.scala`Yu ISHIKAWA2015-08-171-8/+9
| | | | | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8265 from yu-iskw/minor-translate-comment. (cherry picked from commit a0910315dae88b033e38a1de07f39ca21f6552ad) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-7808] [ML] add package doc for ml.featureXiangrui Meng2015-08-171-0/+89
| | | | | | | | | | | This PR adds a short description of `ml.feature` package with code example. The Java package doc will come in a separate PR. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8260 from mengxr/SPARK-7808. (cherry picked from commit e290029a356222bddf4da1be0525a221a5a1630b) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-10059] [YARN] Explicitly add JSP dependencies for tests.Marcelo Vanzin2015-08-171-3/+19
| | | | | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8251 from vanzin/SPARK-10059. (cherry picked from commit ee093c8b927e8d488aeb76115c7fb0de96af7720) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>