aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-8962] Add Scalastyle rule to ban direct use of Class.forName; fix ↵Josh Rosen2015-07-1449-84/+117
| | | | | | | | | | | | | | | | | | | | | existing uses This pull request adds a Scalastyle regex rule which fails the style check if `Class.forName` is used directly. `Class.forName` always loads classes from the default / system classloader, but in a majority of cases, we should be using Spark's own `Utils.classForName` instead, which tries to load classes from the current thread's context classloader and falls back to the classloader which loaded Spark when the context classloader is not defined. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7350) <!-- Reviewable:end --> Author: Josh Rosen <joshrosen@databricks.com> Closes #7350 from JoshRosen/ban-Class.forName and squashes the following commits: e3e96f7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into ban-Class.forName c0b7885 [Josh Rosen] Hopefully fix the last two cases d707ba7 [Josh Rosen] Fix uses of Class.forName that I missed in my first cleanup pass 046470d [Josh Rosen] Merge remote-tracking branch 'origin/master' into ban-Class.forName 62882ee [Josh Rosen] Fix uses of Class.forName or add exclusion. d9abade [Josh Rosen] Add stylechecker rule to ban uses of Class.forName
* [SPARK-4362] [MLLIB] Make prediction probability available in NaiveBayesModelSean Owen2015-07-142-18/+113
| | | | | | | | | | | | | | Add predictProbabilities to Naive Bayes, return class probabilities. Continues https://github.com/apache/spark/pull/6761 Author: Sean Owen <sowen@cloudera.com> Closes #7376 from srowen/SPARK-4362 and squashes the following commits: 23d5a76 [Sean Owen] Fix model.labels -> model.theta 95d91fb [Sean Owen] Check that predicted probabilities sum to 1 b32d1c8 [Sean Owen] Add predictProbabilities to Naive Bayes, return class probabilities
* [SPARK-8800] [SQL] Fix inaccurate precision/scale of Decimal division operationLiang-Chi Hsieh2015-07-142-4/+20
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-8800 Previously, we turn to Java BigDecimal's divide with specified ROUNDING_MODE to avoid non-terminating decimal expansion problem. However, as JihongMA reported, for the division operation on some specific values, we get inaccurate results. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #7212 from viirya/fix_decimal4 and squashes the following commits: 4205a0a [Liang-Chi Hsieh] Fix inaccuracy precision/scale of Decimal division operation.
* [SPARK-4072] [CORE] Display Streaming blocks in Streaming UIzsxwing2015-07-1413-28/+684
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace #6634 This PR adds `SparkListenerBlockUpdated` to SparkListener so that it can monitor all block update infos that are sent to `BlockManagerMasaterEndpoint`, and also add new tables in the Storage tab to display the stream block infos. ![screen shot 2015-07-01 at 5 19 46 pm](https://cloud.githubusercontent.com/assets/1000778/8451562/c291a6ec-2016-11e5-890d-0afc174e1f8c.png) Author: zsxwing <zsxwing@gmail.com> Closes #6672 from zsxwing/SPARK-4072-2 and squashes the following commits: df2c1d8 [zsxwing] Use xml query to check the xml elements 54d54af [zsxwing] Add unit tests for StoragePage e29fb53 [zsxwing] Update as per TD's comments ccbee07 [zsxwing] Fix the code style 6dc42b4 [zsxwing] Fix the replication level of blocks 450fad1 [zsxwing] Merge branch 'master' into SPARK-4072-2 1e9ef52 [zsxwing] Don't categorize by Executor ID ca0ab69 [zsxwing] Fix the code style 3de2762 [zsxwing] Make object BlockUpdatedInfo private e95b594 [zsxwing] Add 'Aggregated Stream Block Metrics by Executor' table ba5d0d1 [zsxwing] Refactor the unit test to improve the readability 4bbe341 [zsxwing] Revert JsonProtocol and don't log SparkListenerBlockUpdated b464dd1 [zsxwing] Add onBlockUpdated to EventLoggingListener 5ba014c [zsxwing] Fix the code style 0b1e47b [zsxwing] Add a developer api BlockUpdatedInfo 04838a9 [zsxwing] Fix the code style 2baa161 [zsxwing] Add unit tests 80f6c6d [zsxwing] Address comments 797ee4b [zsxwing] Display Streaming blocks in Streaming UI
* [SPARK-8718] [GRAPHX] Improve EdgePartition2D for non perfect square number ↵Andrew Ray2015-07-141-11/+21
| | | | | | | | | | | | | | | | | | | | of partitions See https://github.com/aray/e2d/blob/master/EdgePartition2D.ipynb Author: Andrew Ray <ray.andrew@gmail.com> Closes #7104 from aray/edge-partition-2d-improvement and squashes the following commits: 3729f84 [Andrew Ray] correct bounds and remove unneeded comments 97f8464 [Andrew Ray] change less 5141ab4 [Andrew Ray] Merge branch 'master' into edge-partition-2d-improvement 925fd2c [Andrew Ray] use new interface for partitioning 001bfd0 [Andrew Ray] Refactor PartitionStrategy so that we can return a prtition function for a given number of parts. To keep compatibility we define default methods that translate between the two implementation options. Made EdgePartition2D use old strategy when we have a perfect square and implement new interface. 5d42105 [Andrew Ray] % -> / 3560084 [Andrew Ray] Merge branch 'master' into edge-partition-2d-improvement f006364 [Andrew Ray] remove unneeded comments cfa2c5e [Andrew Ray] Modifications to EdgePartition2D so that it works for non perfect squares.
* [SPARK-9031] Merge BlockObjectWriter and DiskBlockObject writer to remove ↵Josh Rosen2015-07-1416-114/+90
| | | | | | | | | | | | | | | | | | | abstract class BlockObjectWriter has only one concrete non-test class, DiskBlockObjectWriter. In order to simplify the code in preparation for other refactorings, I think that we should remove this base class and have only DiskBlockObjectWriter. While at one time we may have planned to have multiple BlockObjectWriter implementations, that doesn't seem to have happened, so the extra abstraction seems unnecessary. Author: Josh Rosen <joshrosen@databricks.com> Closes #7391 from JoshRosen/shuffle-write-interface-refactoring and squashes the following commits: c418e33 [Josh Rosen] Fix compilation 5047995 [Josh Rosen] Fix comments d5dc548 [Josh Rosen] Update references in comments 89dc797 [Josh Rosen] Rename test suite. 5755918 [Josh Rosen] Remove unnecessary val in case class 1607c91 [Josh Rosen] Merge BlockObjectWriter and DiskBlockObjectWriter
* [SPARK-8911] Fix local mode endless heartbeatsAndrew Or2015-07-141-7/+13
| | | | | | | | | | As of #7173 we expect executors to properly register with the driver before responding to their heartbeats. This behavior is not matched in local mode. This patch adds the missing event that needs to be posted. Author: Andrew Or <andrew@databricks.com> Closes #7382 from andrewor14/fix-local-heartbeat and squashes the following commits: 1258bdf [Andrew Or] Post ExecutorAdded event to local executor
* [SPARK-8933] [BUILD] Provide a --force flag to build/mvn that always uses ↵Brennon York2015-07-141-1/+10
| | | | | | | | | | | | downloaded maven added --force flag to manually download, if necessary, and use a built-in version of maven best for spark Author: Brennon York <brennon.york@capitalone.com> Closes #7374 from brennonyork/SPARK-8933 and squashes the following commits: d673127 [Brennon York] added --force flag to manually download, if necessary, and use a built-in version of maven best for spark
* [SPARK-9027] [SQL] Generalize metastore predicate pushdownMichael Armbrust2015-07-142-25/+107
| | | | | | | | | | Add support for pushing down metastore filters that are in different orders and add some unit tests. Author: Michael Armbrust <michael@databricks.com> Closes #7386 from marmbrus/metastoreFilters and squashes the following commits: 05a4524 [Michael Armbrust] [SPARK-9027][SQL] Generalize metastore predicate pushdown
* [SPARK-9029] [SQL] shortcut CaseKeyWhen if key is nullWenchen Fan2015-07-141-24/+24
| | | | | | | | Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7389 from cloud-fan/case-when and squashes the following commits: ea4b6ba [Wenchen Fan] shortcut for case key when
* [SPARK-6851] [SQL] function least/greatest follow upDaoyuan Wang2015-07-143-49/+62
| | | | | | | | | | | | This is a follow up of remaining comments from #6851 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #7387 from adrian-wang/udflgfollow and squashes the following commits: 6163e62 [Daoyuan Wang] add skipping null values e8c2e09 [Daoyuan Wang] use seq 8362966 [Daoyuan Wang] pr6851 follow up
* [SPARK-9010] [DOCUMENTATION] Improve the Spark Configuration document about ↵zhaishidan2015-07-141-1/+1
| | | | | | | | | | | | | | `spark.kryoserializer.buffer` The meaning of spark.kryoserializer.buffer should be "Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.". The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4. Author: zhaishidan <zhaishidan@haizhi.com> Closes #7393 from stanzhai/master and squashes the following commits: 69729ef [zhaishidan] fix document error about spark.kryoserializer.buffer.max.mb
* [SPARK-9001] Fixing errors in javadocs that lead to failed build/sbt docJoseph Gonzalez2015-07-146-12/+21
| | | | | | | | | | | | | | | | | These are minor corrections in the documentation of several classes that are preventing: ```bash build/sbt publish-local ``` I believe this might be an issue associated with running JDK8 as ankurdave does not appear to have this issue in JDK7. Author: Joseph Gonzalez <joseph.e.gonzalez@gmail.com> Closes #7354 from jegonzal/FixingJavadocErrors and squashes the following commits: 6664b7e [Joseph Gonzalez] making requested changes 2e16d89 [Joseph Gonzalez] Fixing errors in javadocs that prevents build/sbt publish-local from completing.
* [SPARK-6910] [SQL] Support for pushing predicates down to metastore for ↵Cheolsoo Park2015-07-139-44/+137
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | partition pruning This PR supersedes my old one #6921. Since my patch has changed quite a bit, I am opening a new PR to make it easier to review. The changes include- * Implement `toMetastoreFilter()` function in `HiveShim` that takes `Seq[Expression]` and converts them into a filter string for Hive metastore. * This functions matches all the `AttributeReference` + `BinaryComparisonOp` + `Integral/StringType` patterns in `Seq[Expression]` and fold them into a string. * Change `hiveQlPartitions` field in `MetastoreRelation` to `getHiveQlPartitions()` function that takes a filter string parameter. * Call `getHiveQlPartitions()` in `HiveTableScan` with a filter string. But there are some cases in which predicate pushdown is disabled- Case | Predicate pushdown ------- | ----------------------------- Hive integral and string types | Yes Hive varchar type | No Hive 0.13 and newer | Yes Hive 0.12 and older | No convertMetastoreParquet=false | Yes convertMetastoreParquet=true | No In case of `convertMetastoreParquet=true`, predicates are not pushed down because this conversion happens in an `Analyzer` rule (`HiveMetastoreCatalog.ParquetConversions`). At this point, `HiveTableScan` hasn't run, so predicates are not available. But reading the source code, I think it is intentional to convert the entire Hive table w/ all the partitions into `ParquetRelation` because then `ParquetRelation` can be cached and reused for any query against that table. Please correct me if I am wrong. cc marmbrus Author: Cheolsoo Park <cheolsoop@netflix.com> Closes #7216 from piaozhexiu/SPARK-6910-2 and squashes the following commits: aa1490f [Cheolsoo Park] Fix ordering of imports c212c4d [Cheolsoo Park] Incorporate review comments 5e93f9d [Cheolsoo Park] Predicate pushdown into Hive metastore
* [SPARK-8743] [STREAMING] Deregister Codahale metrics for streaming when ↵Neelesh Srinivas Salian2015-07-132-6/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | StreamingContext is closed The issue link: https://issues.apache.org/jira/browse/SPARK-8743 Deregister Codahale metrics for streaming when StreamingContext is closed Design: Adding the method calls in the appropriate start() and stop () methods for the StreamingContext Actions in the PullRequest: 1) Added the registerSource method call to the start method for the Streaming Context. 2) Added the removeSource method to the stop method. 3) Added comments for both 1 and 2 and comment to show initialization of the StreamingSource 4) Added a test case to check for both registration and de-registration of metrics Previous closed PR for reference: https://github.com/apache/spark/pull/7250 Author: Neelesh Srinivas Salian <nsalian@cloudera.com> Closes #7362 from nssalian/branch-SPARK-8743 and squashes the following commits: 7d998a3 [Neelesh Srinivas Salian] Removed the Thread.sleep() call 8b26397 [Neelesh Srinivas Salian] Moved the scalatest.{} import 0e8007a [Neelesh Srinivas Salian] moved import org.apache.spark{} to correct place daedaa5 [Neelesh Srinivas Salian] Corrected Ordering of imports 8873180 [Neelesh Srinivas Salian] Removed redundancy in imports 59227a4 [Neelesh Srinivas Salian] Changed the ordering of the imports to classify scala and spark imports d8cb577 [Neelesh Srinivas Salian] Added registerSource to start() and removeSource to stop(). Wrote a test to check the registration and de-registration
* [SPARK-8533] [STREAMING] Upgrade Flume to 1.6.0Hari Shreedharan2015-07-131-1/+1
| | | | | | | | Author: Hari Shreedharan <hshreedharan@apache.org> Closes #6939 from harishreedharan/upgrade-flume-1.6.0 and squashes the following commits: 94b80ae [Hari Shreedharan] [SPARK-8533][Streaming] Upgrade Flume to 1.6.0
* [SPARK-8636] [SQL] Fix equalNullSafe comparisonVinod K C2015-07-132-9/+6
| | | | | | | | | | Author: Vinod K C <vinod.kc@huawei.com> Closes #7040 from vinodkc/fix_CaseKeyWhen_equalNullSafe and squashes the following commits: be5e641 [Vinod K C] Renamed equalNullSafe to threeValueEquals aac9f67 [Vinod K C] Updated test suite and genCode method f2d0b53 [Vinod K C] Fix equalNullSafe comparison
* [SPARK-8991] [ML] Update SharedParamsCodeGen's Generated DocumentationVinod K C2015-07-132-21/+19
| | | | | | | | | | | Removed private[ml] from Generated documentation Author: Vinod K C <vinod.kc@huawei.com> Closes #7367 from vinodkc/fix_sharedparmascodegen and squashes the following commits: 4fa3c8f [Vinod K C] Adding auto generated code 7e19025 [Vinod K C] Removed private[ml]
* [SPARK-8954] [BUILD] Remove unneeded deb repository from Dockerfile to fix ↵yongtang2015-07-131-5/+5
| | | | | | | | | | | | | | | build error in docker. [SPARK-8954] [Build] 1. Remove unneeded deb repository from Dockerfile to fix build error in docker. 2. Remove unneeded /var/lib/apt/lists/* after install to reduce the docker image size (by ~30MB). Author: yongtang <yongtang@users.noreply.github.com> Closes #7346 from yongtang/SPARK-8954 and squashes the following commits: 36024a1 [yongtang] [SPARK-8954] [Build] Remove unneeded /var/lib/apt/lists/* after install to reduce the docker image size (by ~30MB) 7084941 [yongtang] [SPARK-8954] [Build] Remove unneeded deb repository from Dockerfile to fix build error in docker.
* Revert "[SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark"Davies Liu2015-07-133-455/+10
| | | | This reverts commit 9b62e9375f032548d386aec7468e3d0f7c6da7b2.
* [SPARK-8950] [WEBUI] Correct the calculation of SchedulerDelay in StagePageCarson Wang2015-07-131-23/+22
| | | | | | | | | | | | In StagePage, the SchedulerDelay is calculated as totalExecutionTime - executorRunTime - executorOverhead - gettingResultTime. But the totalExecutionTime is calculated in the way that doesn't include the gettingResultTime. Author: Carson Wang <carson.wang@intel.com> Closes #7319 from carsonwang/SchedulerDelayTime and squashes the following commits: f66fb6e [Carson Wang] Update the code style 7d971ae [Carson Wang] Correct the calculation of SchedulerDelay
* [SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySparkMechCoder2015-07-133-10/+455
| | | | | | | | | | | | | | | | | | | | This adds Pylint checks to PySpark. For now this lazy installs using easy_install to /dev/pylint (similar to the pep8 script). We still need to figure out what rules to be allowed. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7241 from MechCoder/pylint and squashes the following commits: 8496834 [MechCoder] Silence warnings and make pylint tests fail to check if it works in jenkins 57393a3 [MechCoder] undefined-variable a8e2547 [MechCoder] Minor changes 7753810 [MechCoder] remove trailing whitespace 75c5d2b [MechCoder] Remove blacklisted arguments and pointless statements check 6bde250 [MechCoder] Disable all checks for now 3464666 [MechCoder] Add pylint configuration file d28109f [MechCoder] [SPARK-8706] [PySpark] [Project infra] Add pylint checks to PySpark
* [SPARK-6797] [SPARKR] Add support for YARN cluster mode.Sun Rui2015-07-1315-54/+133
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR enables SparkR to dynamically ship the SparkR binary package to the AM node in YARN cluster mode, thus it is no longer required that the SparkR package be installed on each worker node. This PR uses the JDK jar tool to package the SparkR package, because jar is thought to be available on both Linux/Windows platforms where JDK has been installed. This PR does not address the R worker involved in RDD API. Will address it in a separate JIRA issue. This PR does not address SBT build. SparkR installation and packaging by SBT will be addressed in a separate JIRA issue. R/install-dev.bat is not tested. shivaram , Could you help to test it? Author: Sun Rui <rui.sun@intel.com> Closes #6743 from sun-rui/SPARK-6797 and squashes the following commits: ca63c86 [Sun Rui] Adjust MimaExcludes after rebase. 7313374 [Sun Rui] Fix unit test errors. 72695fb [Sun Rui] Fix unit test failures. 193882f [Sun Rui] Fix Mima test error. fe25a33 [Sun Rui] Fix Mima test error. 35ecfa3 [Sun Rui] Fix comments. c38a005 [Sun Rui] Unzipped SparkR binary package is still required for standalone and Mesos modes. b05340c [Sun Rui] Fix scala style. 2ca5048 [Sun Rui] Fix comments. 1acefd1 [Sun Rui] Fix scala style. 0aa1e97 [Sun Rui] Fix scala style. 41d4f17 [Sun Rui] Add support for locating SparkR package for R workers required by RDD APIs. 49ff948 [Sun Rui] Invoke jar.exe with full path in install-dev.bat. 7b916c5 [Sun Rui] Use 'rem' consistently. 3bed438 [Sun Rui] Add a comment. 681afb0 [Sun Rui] Fix a bug that RRunner does not handle client deployment modes. cedfbe2 [Sun Rui] [SPARK-6797][SPARKR] Add support for YARN cluster mode.
* [SPARK-8596] Add module for rstudio link to sparkVincent D. Warmerdam2015-07-131-1/+1
| | | | | | | | | | shivaram, added module for rstudio install Author: Vincent D. Warmerdam <vincentwarmerdam@gmail.com> Closes #7366 from koaning/rstudio-install and squashes the following commits: e47c2da [Vincent D. Warmerdam] added rstudio module
* [SPARK-8944][SQL] Support casting between IntervalType and StringTypeWenchen Fan2015-07-134-1/+120
| | | | | | | | | | | Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7355 from cloud-fan/fromString and squashes the following commits: 3bbb9d6 [Wenchen Fan] fix code gen 7dab957 [Wenchen Fan] naming fix 0fbbe19 [Wenchen Fan] address comments ac1f3d1 [Wenchen Fan] Support casting between IntervalType and StringType
* [SPARK-8203] [SPARK-8204] [SQL] conditional function: least/greatestDaoyuan Wang2015-07-135-5/+263
| | | | | | | | | | | | | chenghao-intel zhichao-li qiansl127 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #6851 from adrian-wang/udflg and squashes the following commits: 0f1bff2 [Daoyuan Wang] address comments from davis 7a6bdbb [Daoyuan Wang] add '.' for hex() c1f6824 [Daoyuan Wang] add codegen, test for all types ec625b0 [Daoyuan Wang] conditional function: least/greatest
* [SPARK-9006] [PYSPARK] fix microsecond loss in Python 3Davies Liu2015-07-121-1/+2
| | | | | | | | | | It may loss a microsecond if using timestamp as float, should be `int` instead. Author: Davies Liu <davies@databricks.com> Closes #7363 from davies/fix_microsecond and squashes the following commits: 36f6007 [Davies Liu] fix microsecond loss in Python 3
* [SPARK-8880] Fix confusing Stage.attemptId member variableKay Ousterhout2015-07-123-12/+18
| | | | | | | | | Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #7275 from kayousterhout/SPARK-8880 and squashes the following commits: 3e9ce7c [Kay Ousterhout] Added missing return type e150278 [Kay Ousterhout] [SPARK-8880] Fix confusing Stage.attemptId member variable
* [SPARK-8970][SQL] remove unnecessary abstraction for ExtractValueWenchen Fan2015-07-103-32/+15
| | | | | | | | | Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7339 from cloud-fan/minor and squashes the following commits: 84a2128 [Wenchen Fan] remove unapply 6a37c12 [Wenchen Fan] remove unnecessary abstraction for ExtractValue
* [SPARK-8994] [ML] tiny cleanups to Params, PipelineJoseph K. Bradley2015-07-102-3/+3
| | | | | | | | | | | Made default impl of Params.validateParams empty CC mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #7349 from jkbradley/pipeline-small-cleanups and squashes the following commits: 4e0f013 [Joseph K. Bradley] small cleanups after SPARK-5956
* [SPARK-6487] [MLLIB] Add sequential pattern mining algorithm PrefixSpan to ↵zhangjiajin2015-07-103-0/+390
| | | | | | | | | | | | | | | | | | | | | | | Spark MLlib Add parallel PrefixSpan algorithm and test file. Support non-temporal sequences. Author: zhangjiajin <zhangjiajin@huawei.com> Author: zhang jiajin <zhangjiajin@huawei.com> Closes #7258 from zhangjiajin/master and squashes the following commits: ca9c4c8 [zhangjiajin] Modified the code according to the review comments. 574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization. ba5df34 [zhangjiajin] Fix a Scala style error. 4c60fb3 [zhangjiajin] Fix some Scala style errors. 1dd33ad [zhangjiajin] Modified the code according to the review comments. 89bc368 [zhangjiajin] Fixed a Scala style error. a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala 951fd42 [zhang jiajin] Delete Prefixspan.scala 575995f [zhangjiajin] Modified the code according to the review comments. 91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.
* [SPARK-8598] [MLLIB] Implementation of 1-sample, two-sided, Kolmogorov ↵jose.cambronero2015-07-105-2/+387
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Smirnov Test for RDDs This contribution is my original work and I license it to the project under it's open source license. Author: jose.cambronero <jose.cambronero@cloudera.com> Closes #6994 from josepablocam/master and squashes the following commits: bbb30b1 [jose.cambronero] renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consistent with method name 0d0c201 [jose.cambronero] kstTest -> kolmogorovSmirnovTest in statistics.md 1f56371 [jose.cambronero] changed ksTest in public API to kolmogorovSmirnovTest for clarity a48ae7b [jose.cambronero] refactor code to account for serializable RealDistribution. Reuse testOneSample( _, cdf) 1bb44bd [jose.cambronero] style and doc changes. Factored out ks test into 2 separate tests 2ec2aa6 [jose.cambronero] initialize to stdnormal when no params passed (and log). Change unit tests to approximate equivalence rather than strict a4bc0c7 [jose.cambronero] changed ksTest(data, distName) to ksTest(data, distName, params*) after api discussions. Changed tests and docs accordingly 7e66f57 [jose.cambronero] copied implementation note to public api docs, and added @see for links to wiki info e760ebd [jose.cambronero] line length changes to fit style check 3288e42 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty 9026895 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty 1226b30 [jose.cambronero] reindent multi-line lambdas, prior intepretation of style guide was wrong on my part 9c0f1af [jose.cambronero] additional style changes incorporated and added documentation to mllib statistics docs 3f81ad2 [jose.cambronero] renamed ks1 sample test for clarity 992293b [jose.cambronero] Style changes as per comments and added implementation note explaining the distributed approach. 6a4784f [jose.cambronero] specified what distributions are available for the convenience method ksTest(data, name) (solely standard normal) 4b8ba61 [jose.cambronero] fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, but prior to adj it was below 0b5e8ec [jose.cambronero] changed KS one sample test to perform just 1 distributed pass (in addition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm 16b5c4c [jose.cambronero] renamed dat to data and eliminated recalc of RDD size by sharing as argument between empirical and evalOneSampleP c18dc66 [jose.cambronero] removed ksTestOpt from API and changed comments in HypothesisTestSuite accordingly f6951b6 [jose.cambronero] changed style and some comments based on feedback from pull request b9cff3a [jose.cambronero] made small changes to pass style check ce8e9a1 [jose.cambronero] added kstest testing in HypothesisTestSuite 4da189b [jose.cambronero] added user facing ks test functions c659ea1 [jose.cambronero] created KS test class 13dfe4d [jose.cambronero] created test result class for ks test
* [SPARK-7735] [PYSPARK] Raise Exception on non-zero exit from pipe commandsScott Taylor2015-07-102-2/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This will allow problems with piped commands to be detected. This will also allow tasks to be retried where errors are rare (such as network problems in piped commands). Author: Scott Taylor <github@megatron.me.uk> Closes #6262 from megatron-me-uk/patch-2 and squashes the following commits: 04ae1d5 [Scott Taylor] Remove spurious empty line 98fa101 [Scott Taylor] fix blank line style error 574b564 [Scott Taylor] Merge pull request #2 from megatron-me-uk/patch-4 0c1e762 [Scott Taylor] Update rdd pipe method for checkCode ab9a2e1 [Scott Taylor] Update rdd pipe tests for checkCode eb4801c [Scott Taylor] fix fail_condition b0ac3a4 [Scott Taylor] Merge pull request #1 from megatron-me-uk/megatron-me-uk-patch-1 a307d13 [Scott Taylor] update rdd tests to test pipe modes 34fcdc3 [Scott Taylor] add optional argument 'mode' for rdd.pipe a0c0161 [Scott Taylor] fix generator issue 8a9ef9c [Scott Taylor] make check_return_code an iterator 0486ae3 [Scott Taylor] style fixes 8ed89a6 [Scott Taylor] Chain generators to prevent potential deadlock 4153b02 [Scott Taylor] fix list.sort returns None 491d3fc [Scott Taylor] Pass a function handle to assertRaises 3344a21 [Scott Taylor] wrap assertRaises with QuietTest 3ab8c7a [Scott Taylor] remove whitespace for style cc1a73d [Scott Taylor] fix style issues in pipe test 8db4073 [Scott Taylor] Add a test for rdd pipe functions 1b3dc4e [Scott Taylor] fix missing space around operator style 0974f98 [Scott Taylor] add space between words in multiline string 45f4977 [Scott Taylor] fix line too long style error 5745d85 [Scott Taylor] Remove space to fix style f552d49 [Scott Taylor] Catch non-zero exit from pipe commands
* [SPARK-8961] [SQL] Makes BaseWriterContainer.outputWriterForRow accepts ↵Cheng Lian2015-07-101-31/+42
| | | | | | | | | | | | | | | | | | | | | InternalRow instead of Row This is a follow-up of [SPARK-8888] [1], which also aims to optimize writing dynamic partitions. Three more changes can be made here: 1. Using `InternalRow` instead of `Row` in `BaseWriterContainer.outputWriterForRow` 2. Using `Cast` expressions to convert partition columns to strings, so that we can leverage code generation. 3. Replacing the FP-style `zip` and `map` calls with a faster imperative `while` loop. [1]: https://issues.apache.org/jira/browse/SPARK-8888 Author: Cheng Lian <lian@databricks.com> Closes #7331 from liancheng/spark-8961 and squashes the following commits: b5ab9ae [Cheng Lian] Casts Java iterator to Scala iterator explicitly 719e63b [Cheng Lian] Makes BaseWriterContainer.outputWriterForRow accepts InternalRow instead of Row
* add inline comment for python testsDavies Liu2015-07-101-0/+1
|
* [SPARK-8990] [SQL] SPARK-8990 DataFrameReader.parquet() should respect user ↵Cheng Lian2015-07-102-1/+22
| | | | | | | | | | specified options Author: Cheng Lian <lian@databricks.com> Closes #7347 from liancheng/spark-8990 and squashes the following commits: 045698c [Cheng Lian] SPARK-8990 DataFrameReader.parquet() should respect user specified options
* [SPARK-7078] [SPARK-7079] Binary processing sort for Spark SQLJosh Rosen2015-07-1028-138/+2254
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a cache-friendly external sorter which operates on serialized bytes and uses this sorter to implement a new sort operator for Spark SQL and DataFrames. ### Overview of the new sorter The new sorter design is inspired by [Alphasort](http://research.microsoft.com/pubs/68249/alphasort.doc) and implements a key-prefix optimization in order to improve the cache friendliness of the sort. In naive sort implementations, the sorting algorithm operates on an array of record pointers. To compare two records for ordering, the sorter must dereference these pointers, which likely involves random memory access, then compare the objects themselves. ![image](https://cloud.githubusercontent.com/assets/50748/8611390/3b1402ae-2675-11e5-8308-1a10bf347e6e.png) In a key-prefix sort, the sort operates on an array which stores the record pointer alongside a prefix of the record's key. When comparing two records for ordering, the sorter first compares the the stored key prefixes. If the ordering can be determined from the key prefixes (i.e. the prefixes are unequal), then the sort can avoid directly comparing the records, avoiding random memory accesses and full record comparisons. For example, if we're sorting a list of strings then we can store the first 8 bytes of the UTF-8 encoded string as the key-prefix and can perform unsigned byte-at-a-time comparisons to determine the ordering of strings based on their prefixes, only resorting to full comparisons for strings that share a common prefix. In cases where the sort key can fit entirely in the space allotted for the key prefix (e.g. the sorting key is an integer), we completely avoid direct record comparison. In this patch's implementation of key-prefix sorting, our sorter's internal array stores a 64-bit long and 64-bit pointer for each record being sorted. The key prefixes are generated by the user when inserting records into the sorter, which uses a user-defined comparison function for comparing them. The `PrefixComparators` object implements a set of comparators for many common types, including primitive numeric types and UTF-8 strings. The actual sorting is implemented by `UnsafeInMemorySorter`. Most consumers will not use this directly, but instead will use `UnsafeExternalSorter`, a class which implements a sort that can spill to disk in response to memory pressure. Internally, `UnsafeExternalSorter` creates `UnsafeInMemorySorters` to perform sorting and uses `UnsafeSortSpillReader/Writer` to spill and read back runs of sorted records and `UnsafeSortSpillMerger` to merge multiple sorted spills into a single sorted iterator. This external sorter integrates with Spark's existing ShuffleMemoryManager for controlling spilling. Many parts of this sorter's design are based on / copied from the more specialized external sort implementation that I designed for the new UnsafeShuffleManager write path; see #5868 for more details on that patch. ### Sorting rows in Spark SQL For now, `UnsafeExternalSorter` is only used by Spark SQL, which uses it to implement a new sort operator, `UnsafeExternalSort`. This sort operator uses a SQL-specific class called `UnsafeExternalRowSorter` that configures an `UnsafeExternalSorter` to use prefix generators and comparators that operate on rows encoded in the UnsafeRow format that was designed for Project Tungsten. I used some interesting unit-testing techniques to test this patch's SQL-specific components. `UnsafeExternalSortSuite` uses the SQL random data generators introduced in #7176 to test the UnsafeSort operator with all atomic types both with and without nullability and in both ascending and descending sort orders. `PrefixComparatorsSuite` contains a cool use of ScalaCheck + ScalaTest's `GeneratorDrivenPropertyChecks` in order to test UTF8String prefix comparison. ### Misc. additional improvements made in this patch This patch made several miscellaneous improvements to related code in Spark SQL: - The logic for selecting physical sort operator implementations, which was partially duplicated in both `Exchange` and `SparkStrategies, has now been consolidated into a `getSortOperator()` helper function in `SparkStrategies`. - The `SparkPlanTest` unit testing helper trait has been extended with new methods for comparing the output produced by two different physical plans. This makes it easy to write tests which assert that two physical operator implementations should produce the same output. I also added a method for disabling the implicit sorting of outputs prior to comparing them, a change which is necessary in order to be able to write proper SparkPlan tests for sort operators. ### Tasks deferred to followup patches While most of this patch's features are reasonably well-tested and complete, there are a number of tasks that are intentionally being deferred to followup patches: - Add tests which mock the ShuffleMemoryManager to check that memory pressure properly triggers spilling (there are examples of this type of test in #5868). - Add tests to ensure that spill files are properly cleaned up after errors. I'd like to do this in the context of a patch which introduces more general metrics for ensuring proper cleanup of tasks' temporary files; see https://issues.apache.org/jira/browse/SPARK-8966 for more details. - Metrics integration: there are some open questions regarding how to track / report spill metrics for non-shuffle operations, so I've deferred most of the IO / shuffle metrics integration for now. - Performance profiling. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6444) <!-- Reviewable:end --> Author: Josh Rosen <joshrosen@databricks.com> Closes #6444 from JoshRosen/sql-external-sort and squashes the following commits: 6beb467 [Josh Rosen] Remove a bunch of overloaded methods to avoid default args. issue 2bbac9c [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort 35dad9f [Josh Rosen] Make sortAnswers = false the default in SparkPlanTest 5135200 [Josh Rosen] Fix spill reading for large rows; add test 2f48777 [Josh Rosen] Add test and fix bug for sorting empty arrays d1e28bc [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort cd05866 [Josh Rosen] Fix scalastyle 3947fc1 [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort d13ac55 [Josh Rosen] Hacky approach to copying of UnsafeRows for sort followed by limit. 845bea3 [Josh Rosen] Remove unnecessary zeroing of row conversion buffer c56ec18 [Josh Rosen] Clean up final row copying code. d31f180 [Josh Rosen] Re-enable NullType sorting test now that SPARK-8868 is fixed 844f4ca [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort 293f109 [Josh Rosen] Add missing license header. f99a612 [Josh Rosen] Fix bugs in string prefix comparison. 9d00afc [Josh Rosen] Clean up prefix comparators for integral types 88aff18 [Josh Rosen] NULL_PREFIX has to be negative infinity for floating point types 613e16f [Josh Rosen] Test with larger data. 1d7ffaa [Josh Rosen] Somewhat hacky fix for descending sorts 08701e7 [Josh Rosen] Fix prefix comparison of null primitives. b86e684 [Josh Rosen] Set global = true in UnsafeExternalSortSuite. 1c7bad8 [Josh Rosen] Make sorting of answers explicit in SparkPlanTest.checkAnswer(). b81a920 [Josh Rosen] Temporarily enable only the passing sort tests 5d6109d [Josh Rosen] Fix inconsistent handling / encoding of record lengths. 87b6ed9 [Josh Rosen] Fix critical issues in test which led to false negatives. 8d7fbe7 [Josh Rosen] Fixes to multiple spilling-related bugs. 82e21c1 [Josh Rosen] Force spilling in UnsafeExternalSortSuite. 88b72db [Josh Rosen] Test ascending and descending sort orders. f27be09 [Josh Rosen] Fix tests by binding attributes. 0a79d39 [Josh Rosen] Revert "Undo part of a SparkPlanTest change in #7162 that broke my test." 7c3c864 [Josh Rosen] Undo part of a SparkPlanTest change in #7162 that broke my test. 9969c14 [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort 5822e6f [Josh Rosen] Fix test compilation issue 939f824 [Josh Rosen] Remove code gen experiment. 0dfe919 [Josh Rosen] Implement prefix sort for strings (albeit inefficiently). 66a813e [Josh Rosen] Prefix comparators for float and double b310c88 [Josh Rosen] Integrate prefix comparators for Int and Long (others coming soon) 95058d9 [Josh Rosen] Add missing SortPrefixUtils file 4c37ba6 [Josh Rosen] Add tests for sorting on all primitive types. 6890863 [Josh Rosen] Fix memory leak on empty inputs. d246e29 [Josh Rosen] Fix consideration of column types when choosing sort implementation. 6b156fb [Josh Rosen] Some WIP work on prefix comparison. 7f875f9 [Josh Rosen] Commit failing test demonstrating bug in handling objects in spills 41b8881 [Josh Rosen] Get UnsafeInMemorySorterSuite to pass (WIP) 90c2b6a [Josh Rosen] Update test name 6d6a1e6 [Josh Rosen] Centralize logic for picking sort operator implementations 9869ec2 [Josh Rosen] Clean up Exchange code a bit 82bb0ec [Josh Rosen] Fix IntelliJ complaint due to negated if condition 1db845a [Josh Rosen] Many more changes to harmonize with shuffle sorter ebf9eea [Josh Rosen] Harmonization with shuffle's unsafe sorter 206bfa2 [Josh Rosen] Add some missing newlines at the ends of files 26c8931 [Josh Rosen] Back out some Hive changes that aren't needed anymore 62f0bb8 [Josh Rosen] Update to reflect SparkPlanTest changes 21d7d93 [Josh Rosen] Back out of BlockObjectWriter change 7eafecf [Josh Rosen] Port test to SparkPlanTest d468a88 [Josh Rosen] Update for InternalRow refactoring 269cf86 [Josh Rosen] Back out SMJ operator change; isolate changes to selection of sort op. 1b841ca [Josh Rosen] WIP towards copying b420a71 [Josh Rosen] Move most of the existing SMJ code into Java. dfdb93f [Josh Rosen] SparkFunSuite change 73cc761 [Josh Rosen] Fix whitespace 9cc98f5 [Josh Rosen] Move more code to Java; fix bugs in UnsafeRowConverter length type. c8792de [Josh Rosen] Remove some debug logging dda6752 [Josh Rosen] Commit some missing code from an old git stash. 58f36d0 [Josh Rosen] Merge in a sketch of a unit test for the new sorter (now failing). 2bd8c9a [Josh Rosen] Import my original tests and get them to pass. d5d3106 [Josh Rosen] WIP towards external sorter for Spark SQL.
* [SPARK-8923] [DOCUMENTATION, MLLIB] Add @since tags to mllib.fpmrahulpalamuttam2015-07-102-0/+29
| | | | | | | | | Author: rahulpalamuttam <rahulpalamut@gmail.com> Closes #7341 from rahulpalamuttam/TaggingMLlibfpm and squashes the following commits: bef2843 [rahulpalamuttam] fix @since tags in mmlib.fpm cd86252 [rahulpalamuttam] Add @since tags to mllib.fpm
* [HOTFIX] fix flaky test in PySpark SQLDavies Liu2015-07-101-2/+3
| | | | | | | | | | It may loss precision in microseconds when using float for it. Author: Davies Liu <davies@databricks.com> Closes #7344 from davies/fix_date_test and squashes the following commits: 249ec61 [Davies Liu] fix flaky test
* [SPARK-8675] Executors created by LocalBackend won't get the same classpath ↵Min Zhou2015-07-101-2/+17
| | | | | | | | | | | | | | | | | | | | | | | | as other executor backends AFAIK, some spark application always use LocalBackend to do some local initiatives, spark sql is an example. Starting a LocalPoint won't add user classpath into executor. ```java override def start() { localEndpoint = SparkEnv.get.rpcEnv.setupEndpoint( "LocalBackendEndpoint", new LocalEndpoint(SparkEnv.get.rpcEnv, scheduler, this, totalCores)) } ``` Thus will cause local executor fail with these scenarios, loading hadoop built-in native libraries, loading other user defined native libraries, loading user jars, reading s3 config from a site.xml file, etc Author: Min Zhou <coderplay@gmail.com> Closes #7091 from coderplay/master and squashes the following commits: 365838f [Min Zhou] Fixed java.net.MalformedURLException, add default scheme, support relative path d215b7f [Min Zhou] Follows spark standard scala style, make the auto testing happy 84ad2cd [Min Zhou] Use system specific path separator instead of ',' 01f5d1a [Min Zhou] Merge branch 'master' of https://github.com/apache/spark e528be7 [Min Zhou] Merge branch 'master' of https://github.com/apache/spark 45bf62c [Min Zhou] SPARK-8675 Executors created by LocalBackend won't get the same classpath as other executor backends
* [CORE] [MINOR] change the log level to infoCheng Hao2015-07-101-1/+1
| | | | | | | | | | Too many logs even when set the log level to warning. Author: Cheng Hao <hao.cheng@intel.com> Closes #7340 from chenghao-intel/log and squashes the following commits: 59658cf [Cheng Hao] change the log level to info
* [SPARK-8958] Dynamic allocation: change cached timeout to infinityAndrew Or2015-07-102-3/+3
| | | | | | | | | | | | pwendell and I discussed this a little more offline and concluded that it would be good to keep it more conservative. Losing cached blocks may be very expensive and we should only allow it if the user knows what he/she is doing. FYI harishreedharan sryza. Author: Andrew Or <andrew@databricks.com> Closes #7329 from andrewor14/da-cached-timeout and squashes the following commits: cef0b4e [Andrew Or] Change timeout to infinity
* [SPARK-7944] [SPARK-8013] Remove most of the Spark REPL fork for Scala 2.11Iulian Dragos2015-07-1011-3181/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR removes most of the code in the Spark REPL for Scala 2.11 and leaves just a couple of overridden methods in `SparkILoop` in order to: - change welcome message - restrict available commands (like `:power`) - initialize Spark context The two codebases have diverged and it's extremely hard to backport fixes from the upstream REPL. This somewhat radical step is absolutely necessary in order to fix other REPL tickets (like SPARK-8013 - Hive Thrift server for 2.11). BTW, the Scala REPL has fixed the serialization-unfriendly wrappers thanks to ScrapCodes's work in [#4522](https://github.com/scala/scala/pull/4522) All tests pass and I tried the `spark-shell` on our Mesos cluster with some simple jobs (including with additional jars), everything looked good. As soon as Scala 2.11.7 is out we need to upgrade and get a shaded `jline` dependency, clearing the way for SPARK-8013. /cc pwendell Author: Iulian Dragos <jaguarul@gmail.com> Closes #6903 from dragos/issue/no-spark-repl-fork and squashes the following commits: c596c6f [Iulian Dragos] Merge branch 'master' into issue/no-spark-repl-fork 2b1a305 [Iulian Dragos] Removed spaces around multiple imports. 0ce67a6 [Iulian Dragos] Remove -verbose flag for java compiler (added by mistake in an earlier commit). 10edaf9 [Iulian Dragos] Keep the jline dependency only in the 2.10 build. 529293b [Iulian Dragos] Add back Spark REPL files to rat-excludes, since they are part of the 2.10 real. d85370d [Iulian Dragos] Remove jline dependency from the Spark REPL. b541930 [Iulian Dragos] Merge branch 'master' into issue/no-spark-repl-fork 2b15962 [Iulian Dragos] Change jline dependency and bump Scala version. b300183 [Iulian Dragos] Rename package and add license on top of the file, remove files from rat-excludes and removed `-Yrepl-sync` per reviewer’s request. 9d46d85 [Iulian Dragos] Fix SPARK-7944. abcc7cb [Iulian Dragos] Remove the REPL forked code.
* [SPARK-7977] [BUILD] Disallowing printlnJonathan Alter2015-07-10182-135/+478
| | | | | | | | | | | | | | | | | | | | | | | Author: Jonathan Alter <jonalter@users.noreply.github.com> Closes #7093 from jonalter/SPARK-7977 and squashes the following commits: ccd44cc [Jonathan Alter] Changed println to log in ThreadingSuite 7fcac3e [Jonathan Alter] Reverting to println in ThreadingSuite 10724b6 [Jonathan Alter] Changing some printlns to logs in tests eeec1e7 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 0b1dcb4 [Jonathan Alter] More println cleanup aedaf80 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 925fd98 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 0c16fa3 [Jonathan Alter] Replacing some printlns with logs 45c7e05 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 5c8e283 [Jonathan Alter] Allowing println in audit-release examples 5b50da1 [Jonathan Alter] Allowing printlns in example files ca4b477 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 83ab635 [Jonathan Alter] Fixing new printlns 54b131f [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 1cd8a81 [Jonathan Alter] Removing some unnecessary comments and printlns b837c3a [Jonathan Alter] Disallowing println
* [DOCS] Added important updateStateByKey detailsMichael Vogiatzis2015-07-091-0/+2
| | | | | | | | | | | | | Runs for *all* existing keys and returning "None" will remove the key-value pair. Author: Michael Vogiatzis <michaelvogiatzis@gmail.com> Closes #7229 from mvogiatzis/patch-1 and squashes the following commits: e7a2946 [Michael Vogiatzis] Updated updateStateByKey text 00283ed [Michael Vogiatzis] Removed space c2656f9 [Michael Vogiatzis] Moved description farther up 0a42551 [Michael Vogiatzis] Added important updateStateByKey details
* [SPARK-8839] [SQL] ThriftServer2 will remove session and execution no matter ↵huangzhaowei2015-07-091-2/+5
| | | | | | | | | | | | | | | | | it's finished or not. In my test, `sessions` and `executions` in ThriftServer2 is not the same number as the connection number. For example, if there are 200 clients connecting to the server, but it will have more than 200 `sessions` and `executions`. So if it reaches the `retainedStatements`, it has to remove some object which is not finished. So it may cause the exception described in [Jira Address](https://issues.apache.org/jira/browse/SPARK-8839) Author: huangzhaowei <carlmartinmax@gmail.com> Closes #7239 from SaintBacchus/SPARK-8839 and squashes the following commits: cf7ef40 [huangzhaowei] Remove the a meanless funciton call 3e9a5a6 [huangzhaowei] Add a filter before take 9d5ceb8 [huangzhaowei] [SPARK-8839][SQL]ThriftServer2 will remove session and execution no matter it's finished or not.
* [SPARK-8913] [ML] Simplify LogisticRegression suite to use Vector Vector ↵Holden Karau2015-07-091-96/+39
| | | | | | | | | | | | comparision Cleanup tests from SPARK 8700. Author: Holden Karau <holden@pigscanfly.ca> Closes #7335 from holdenk/SPARK-8913-cleanup-tests-from-SPARK-8700-logistic-regression-r2-really-logistic-regression-this-time and squashes the following commits: e5e2c5f [Holden Karau] Simplify LogisticRegression suite to use Vector <-> Vector comparisions instead of comparing element by element
* [SPARK-8852] [FLUME] Trim dependencies in flume assembly.Marcelo Vanzin2015-07-092-73/+100
| | | | | | | | | | | | | | | | Also, add support for the *-provided profiles. This avoids repackaging things that are already in the Spark assembly, or, in the case of the *-provided profiles, are provided by the distribution. The flume-ng-auth dependency was also excluded since it's not really used by Spark. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7247 from vanzin/SPARK-8852 and squashes the following commits: 298a7d5 [Marcelo Vanzin] Feedback. c962082 [Marcelo Vanzin] [SPARK-8852] [flume] Trim dependencies in flume assembly.
* [SPARK-8959] [SQL] [HOTFIX] Removes parquet-thrift and libthrift dependenciesCheng Lian2015-07-097-3494/+8
| | | | | | | | | | | | | | | | | These two dependencies were introduced in #7231 to help testing Parquet compatibility with `parquet-thrift`. However, they somehow crash the Scala compiler in Maven builds. This PR fixes this issue by: 1. Removing these two dependencies, and 2. Instead of generating the testing Parquet file programmatically, checking in an actual testing Parquet file generated by `parquet-thrift` as a test resource. This is just a quick fix to bring back Maven builds. Need to figure out the root case as binary Parquet files are harder to maintain. Author: Cheng Lian <lian@databricks.com> Closes #7330 from liancheng/spark-8959 and squashes the following commits: cf69512 [Cheng Lian] Brings back Maven builds
* [SPARK-8538] [SPARK-8539] [ML] Linear Regression Training and Testing ResultsFeynman Liang2015-07-092-6/+192
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds results (e.g. objective value at each iteration, residuals) on training and user-specified test sets for LinearRegressionModel. Notes to Reviewers: * Are the `*TrainingResults` and `Results` classes too specialized for `LinearRegressionModel`? Where would be an appropriate level of abstraction? * Please check `transient` annotations are correct; the datasets should not be copied and kept during serialization. * Any thoughts on `RDD`s versus `DataFrame`s? If using `DataFrame`s, suggested schemas for each intermediate step? Also, how to create a "local DataFrame" without a `sqlContext`? Author: Feynman Liang <fliang@databricks.com> Closes #7099 from feynmanliang/SPARK-8538 and squashes the following commits: d219fa4 [Feynman Liang] Update docs 4a42680 [Feynman Liang] Change Summary to hold values, move transient annotations down to metrics and predictions DF 6300031 [Feynman Liang] Code review changes 0a5e762 [Feynman Liang] Fix build error e71102d [Feynman Liang] Merge branch 'master' into SPARK-8538 3367489 [Feynman Liang] Merge branch 'master' into SPARK-8538 70f267c [Feynman Liang] Make TrainingSummary transient and remove Serializable from *Summary and RegressionMetrics 1d9ea42 [Feynman Liang] Fix failing Java test a65dfda [Feynman Liang] Make TrainingSummary and metrics serializable, prediction dataframe transient 0a605d8 [Feynman Liang] Replace Params from LinearRegression*Summary with private constructor vals c2fe835 [Feynman Liang] Optimize imports 02d8a70 [Feynman Liang] Add Params to LinearModel*Summary, refactor tests and add test for evaluate() 8f999f4 [Feynman Liang] Refactor from jkbradley code review 072e948 [Feynman Liang] Style 509ae36 [Feynman Liang] Use DFs and localize serialization to LinearRegressionModel 9509c79 [Feynman Liang] Fix imports b2bbaa3 [Feynman Liang] Refactored LinearRegressionResults API to be more private ffceaec [Feynman Liang] Merge branch 'master' into SPARK-8538 1cedb2b [Feynman Liang] Add test for decreasing objective trace dab0aff [Feynman Liang] Add LinearRegressionTrainingResults tests, make test suite code copy+pasteable 97b0a81 [Feynman Liang] Add LinearRegressionModel.evaluate() to get results on test sets dc51bce [Feynman Liang] Style guide fixes 521f397 [Feynman Liang] Use RDD[(Double, Double)] instead of DF 2ff5710 [Feynman Liang] Add training results and model summary to ML LinearRegression