aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-9867] [SQL] Move utilities for binary data into ByteArrayTakeshi YAMAMURO2015-10-013-51/+52
| | | | | | | | The utilities such as Substring#substringBinarySQL and BinaryPrefixComparator#computePrefix for binary data are put together in ByteArray for easy-to-read. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #8122 from maropu/CleanUpForBinaryType.
* [SPARK-10400] [SQL] Renames SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPECCheng Lian2015-10-016-148/+231
| | | | | | | | | | | | We introduced SQL option `spark.sql.parquet.followParquetFormatSpec` while working on implementing Parquet backwards-compatibility rules in SPARK-6777. It indicates whether we should use legacy Parquet format adopted by Spark 1.4 and prior versions or the standard format defined in parquet-format spec to write Parquet files. This option defaults to `false` and is marked as a non-public option (`isPublic = false`) because we haven't finished refactored Parquet write path. The problem is, the name of this option is somewhat confusing, because it's not super intuitive why we shouldn't follow the spec. Would be nice to rename it to `spark.sql.parquet.writeLegacyFormat`, and invert its default value (the two option names have opposite meanings). Although this option is private in 1.5, we'll make it public in 1.6 after refactoring Parquet write path. So that users can decide whether to write Parquet files in standard format or legacy format. Author: Cheng Lian <lian@databricks.com> Closes #8566 from liancheng/spark-10400/deprecate-follow-parquet-format-spec.
* [SPARK-10671] [SQL] Throws an analysis exception if we cannot find Hive UDFsWenchen Fan2015-10-012-23/+104
| | | | | | | | Takes over https://github.com/apache/spark/pull/8800 Author: Wenchen Fan <cloud0fan@163.com> Closes #8941 from cloud-fan/hive-udf.
* [SPARK-10865] [SPARK-10866] [SQL] Fix bug of ceil/floor, which should ↵Cheng Hao2015-10-013-11/+31
| | | | | | | | | | | | returns long instead of the Double type Floor & Ceiling function should returns Long type, rather than Double. Verified with MySQL & Hive. Author: Cheng Hao <hao.cheng@intel.com> Closes #8933 from chenghao-intel/ceiling.
* [SPARK-10058] [CORE] [TESTS] Fix the flaky tests in HeartbeatReceiverSuitezsxwing2015-10-012-16/+60
| | | | | | | | | | | | Fixed the test failure here: https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/116/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark/HeartbeatReceiverSuite/normal_heartbeat/ This failure is because `HeartbeatReceiverSuite. heartbeatReceiver` may receive `SparkListenerExecutorAdded("driver")` sent from [LocalBackend](https://github.com/apache/spark/blob/8fb3a65cbb714120d612e58ef9d12b0521a83260/core/src/main/scala/org/apache/spark/scheduler/local/LocalBackend.scala#L121). There are other race conditions in `HeartbeatReceiverSuite` because `HeartbeatReceiver.onExecutorAdded` and `HeartbeatReceiver.onExecutorRemoved` are asynchronous. This PR also fixed them. Author: zsxwing <zsxwing@gmail.com> Closes #8946 from zsxwing/SPARK-10058.
* [SPARK-10807] [SPARKR] Added as.data.frame as a synonym for collectOscar D. Lara Yejas2015-09-304-1/+39
| | | | | | | | | | Created method as.data.frame as a synonym for collect(). Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu> Author: olarayej <oscar.lara.yejas@us.ibm.com> Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com> Closes #8908 from olarayej/SPARK-10807.
* [SPARK-9617] [SQL] Implement json_tupleNathan Howell2015-09-304-4/+316
| | | | | | | | This is an implementation of Hive's `json_tuple` function using Jackson Streaming. Author: Nathan Howell <nhowell@godaddy.com> Closes #7946 from NathanHowell/SPARK-9617.
* [SPARK-10770] [SQL] SparkPlan.executeCollect/executeTake should return ↵Reynold Xin2015-09-309-39/+39
| | | | | | | | InternalRow rather than external Row. Author: Reynold Xin <rxin@databricks.com> Closes #8900 from rxin/SPARK-10770-1.
* [SPARK-10851] [SPARKR] Exception not failing R applications (in yarn cluster ↵Sun Rui2015-09-301-3/+7
| | | | | | | | | | | | mode) The YARN backend doesn't like when user code calls System.exit, since it cannot know the exit status and thus cannot set an appropriate final status for the application. This PR remove the usage of system.exit to exit the RRunner. Instead, when the R process running an SparkR script returns an exit code other than 0, throws SparkUserAppException which will be caught by ApplicationMaster and ApplicationMaster knows it failed. For other failures, throws SparkException. Author: Sun Rui <rui.sun@intel.com> Closes #8938 from sun-rui/SPARK-10851.
* [SPARK-9741] [SQL] Approximate Count Distinct using the new UDAF interface.Herman van Hovell2015-09-303-0/+554
| | | | | | | | | | | | | | This PR implements a HyperLogLog based Approximate Count Distinct function using the new UDAF interface. The implementation is inspired by the ClearSpring HyperLogLog implementation and should produce the same results. There is still some documentation and testing left to do. cc yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #8362 from hvanhovell/SPARK-9741.
* [SPARK-10736] [ML] Use 1 for all ratings if $(ratingCol) = ""Yanbo Liang2015-09-291-2/+2
| | | | | | | | For some implicit dataset, ratings may not exist in the training data. In this case, we can assume all observed pairs to be positive and treat their ratings as 1. This should happen when users set ```ratingCol``` to an empty string. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8937 from yanboliang/spark-10736.
* [SPARK-10811] [SQL] Eliminates unnecessary byte array copyingCheng Lian2015-09-293-10/+28
| | | | | | | | | | | | | | When reading Parquet string and binary-backed decimal values, Parquet `Binary.getBytes` always returns a copied byte array, which is unnecessary. Since the underlying implementation of `Binary` values there is guaranteed to be `ByteArraySliceBackedBinary`, and Parquet itself never reuses underlying byte arrays, we can use `Binary.toByteBuffer.array()` to steal the underlying byte arrays without copying them. This brings performance benefits when scanning Parquet string and binary-backed decimal columns. Note that, this trick doesn't cover binary-backed decimals with precision greater than 18. My micro-benchmark result is that, this brings a ~15% performance boost for scanning TPC-DS `store_sales` table (scale factor 15). Another minor optimization done in this PR is that, now we directly construct a Java `BigDecimal` in `Decimal.toJavaBigDecimal` without constructing a Scala `BigDecimal` first. This brings another ~5% performance gain. Author: Cheng Lian <lian@databricks.com> Closes #8907 from liancheng/spark-10811/eliminate-array-copying.
* [SPARK-10782] [PYTHON] Update dropDuplicates documentationasokadiggs2015-09-291-0/+2
| | | | | | | | Documentation for dropDuplicates() and drop_duplicates() is one and the same. Resolved the error in the example for drop_duplicates using the same approach used for groupby and groupBy, by indicating that dropDuplicates and drop_duplicates are aliases. Author: asokadiggs <asoka.diggs@intel.com> Closes #8930 from asokadiggs/jira-10782.
* [SPARK-6919] [PYSPARK] Add asDict method to StatCounterErik Shilts2015-09-292-0/+42
| | | | | | | | | | | | Add method to easily convert a StatCounter instance into a Python dict https://issues.apache.org/jira/browse/SPARK-6919 Note: This is my original work and the existing Spark license applies. Author: Erik Shilts <erik.shilts@opower.com> Closes #5516 from eshilts/statcounter-asdict.
* [SPARK-10415] [PYSPARK] [MLLIB] [DOCS] Enhance Navigation Sidebar in PySpark APInoelsmith2015-09-294-2/+197
| | | | | | | | | | | | | | | | | | These are CSS/JavaScript changes changes to make navigation in the PySpark API a bit simpler by adding the following to the sidebar: * Classes * Functions * Tags to highlight experimental features ![screen shot 2015-09-02 at 08 50 12](https://cloud.githubusercontent.com/assets/11915197/9634781/301f853a-518b-11e5-8d5c-fda202f6202f.png) Online example here: https://dl.dropboxusercontent.com/u/20821334/pyspark-api-nav-enhance/pyspark.mllib.html (The contribution is my original work and that I license the work to the project under the project's open source license) Author: noelsmith <mail@noelsmith.com> Closes #8571 from noel-smith/pyspark-api-nav-enhance.
* [SPARK-10871] include number of executor failures in error msgRyan Williams2015-09-291-1/+1
| | | | | | Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #8939 from ryan-williams/errmsg.
* [SPARK-10825] [CORE] [TESTS] Fix race conditions in ↵zsxwing2015-09-291-113/+192
| | | | | | | | | | | | | | | | | | | | StandaloneDynamicAllocationSuite Fix the following issues in StandaloneDynamicAllocationSuite: 1. It should not assume master and workers start in order 2. It should not assume master and workers get ready at once 3. It should not assume the application is already registered with master after creating SparkContext 4. It should not access Master.app and idToApp which are not thread safe The changes includes: * Use `eventually` to wait until master and workers are ready to fix 1 and 2 * Use `eventually` to wait until the application is registered with master to fix 3 * Use `askWithRetry[MasterStateResponse](RequestMasterState)` to get the application info to fix 4 Author: zsxwing <zsxwing@gmail.com> Closes #8914 from zsxwing/fix-StandaloneDynamicAllocationSuite.
* [SPARK-10670] [ML] [Doc] add api reference for ml docYuhao Yang2015-09-281-64/+195
| | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-10670 In the Markdown docs for the spark.ml Programming Guide, we have code examples with codetabs for each language. We should link to each language's API docs within the corresponding codetab, but we are inconsistent about this. For an example of what we want to do, see the "Word2Vec" section in https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/ml-features.md This JIRA is just for spark.ml, not spark.mllib Author: Yuhao Yang <hhbyyh@gmail.com> Closes #8901 from hhbyyh/docAPI.
* [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSESean Owen2015-09-2840-679/+1153
| | | | | | | | | | | | | | In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to light that the guidance at http://www.apache.org/dev/licensing-howto.html#permissive-deps means that permissively-licensed dependencies has a different interpretation than we (er, I) had been operating under. "pointer ... to the license within the source tree" specifically means a copy of the license within Spark's distribution, whereas at the moment, Spark's LICENSE has a pointer to the project's license in the other project's source tree. The remedy is simply to inline all such license references (i.e. BSD/MIT licenses) or include their text in "licenses" subdirectory and point to that. Along the way, we can also treat other BSD/MIT licenses, whose text has been inlined into LICENSE, in the same way. The LICENSE file can continue to provide a helpful list of BSD/MIT licensed projects and a pointer to their sites. This would be over and above including license text in the distro, which is the essential thing. Author: Sean Owen <sowen@cloudera.com> Closes #8919 from srowen/SPARK-10833.
* [SPARK-10859] [SQL] fix stats of StringType in columnar cacheDavies Liu2015-09-282-2/+9
| | | | | | | | | | The UTF8String may come from UnsafeRow, then underline buffer of it is not copied, so we should clone it in order to hold it in Stats. cc yhuai Author: Davies Liu <davies@databricks.com> Closes #8929 from davies/pushdown_string.
* [SPARK-10395] [SQL] Simplifies CatalystReadSupportCheng Lian2015-09-281-47/+45
| | | | | | | | | | Please refer to [SPARK-10395] [1] for details. [1]: https://issues.apache.org/jira/browse/SPARK-10395 Author: Cheng Lian <lian@databricks.com> Closes #8553 from liancheng/spark-10395/simplify-parquet-read-support.
* [SPARK-10790] [YARN] Fix initial executor number not set issue and ↵jerryshao2015-09-284-40/+27
| | | | | | | | | | | | consolidate the codes This bug is introduced in [SPARK-9092](https://issues.apache.org/jira/browse/SPARK-9092), `targetExecutorNumber` should use `minExecutors` if `initialExecutors` is not set. Using 0 instead will meet the problem as mentioned in [SPARK-10790](https://issues.apache.org/jira/browse/SPARK-10790). Also consolidate and simplify some similar code snippets to keep the consistent semantics. Author: jerryshao <sshao@hortonworks.com> Closes #8910 from jerryshao/SPARK-10790.
* [SPARK-10812] [YARN] Spark hadoop util support switching to yarnHolden Karau2015-09-284-16/+34
| | | | | | | | | | | | | | | | | | | | | | | | While this is likely not a huge issue for real production systems, for test systems which may setup a Spark Context and tear it down and stand up a Spark Context with a different master (e.g. some local mode & some yarn mode) tests this cane be an issue. Discovered during work on spark-testing-base on Spark 1.4.1, but seems like the logic that triggers it is present in master (see SparkHadoopUtil object). A valid work around for users encountering this issue is to fork a different JVM, however this can be heavy weight. ``` [info] SampleMiniClusterTest: [info] Exception encountered when attempting to run a suite with class name: com.holdenkarau.spark.testing.SampleMiniClusterTest *** ABORTED *** [info] java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil [info] at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:163) [info] at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:257) [info] at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561) [info] at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115) [info] at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57) [info] at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141) [info] at org.apache.spark.SparkContext.<init>(SparkContext.scala:497) [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.setup(SharedMiniCluster.scala:186) [info] at com.holdenkarau.spark.testing.SampleMiniClusterTest.setup(SampleMiniClusterTest.scala:26) [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.beforeAll(SharedMiniCluster.scala:103) ``` Author: Holden Karau <holden@pigscanfly.ca> Closes #8911 from holdenk/SPARK-10812-spark-hadoop-util-support-switching-to-yarn.
* Fix two mistakes in programming-guide pageDavid Martin2015-09-281-2/+2
| | | | | | | | | seperate -> separate sees -> see Author: David Martin <dmartinpro@users.noreply.github.com> Closes #8928 from dmartinpro/patch-1.
* add doc for spark.streaming.stopGracefullyOnShutdownBin Wang2015-09-271-0/+8
| | | | | | Author: Bin Wang <wbin00@gmail.com> Closes #8898 from wb14123/doc.
* [SPARK-10720] [SQL] [JAVA] Add a java wrapper to create a dataframe from a ↵Holden Karau2015-09-272-17/+56
| | | | | | | | | | local list of java beans Similar to SPARK-10630 it would be nice if Java users didn't have to parallelize there data explicitly (as Scala users already can skip). Issue came up in http://stackoverflow.com/questions/32613413/apache-spark-machine-learning-cant-get-estimator-example-to-work Author: Holden Karau <holden@pigscanfly.ca> Closes #8879 from holdenk/SPARK-10720-add-a-java-wrapper-to-create-a-dataframe-from-a-local-list-of-java-beans.
* [SPARK-10741] [SQL] Hive Query Having/OrderBy against Parquet table is not ↵Wenchen Fan2015-09-2715-86/+103
| | | | | | | | | | | working https://issues.apache.org/jira/browse/SPARK-10741 I choose the second approach: do not change output exprIds when convert MetastoreRelation to LogicalRelation Author: Wenchen Fan <cloud0fan@163.com> Closes #8889 from cloud-fan/hot-bug.
* [SPARK-10778] [MLLIB] Implement toString for AssociationRules.Ruley-shimizu2015-09-271-0/+5
| | | | | | | | I implemented toString for AssociationRules.Rule, format like `[x, y] => {z}: 1.0` Author: y-shimizu <y.shimizu0429@gmail.com> Closes #8904 from y-shimizu/master.
* [SPARK-10845] [SQL] Makes spark.sql.hive.version a SQLConfEntryCheng Lian2015-09-263-1/+32
| | | | | | | | | | When refactoring SQL options from plain strings to the strongly typed `SQLConfEntry`, `spark.sql.hive.version` wasn't migrated, and doesn't show up in the result of `SET -v`, as `SET -v` only shows public `SQLConfEntry` instances. This affects compatibility with Simba ODBC driver. This PR migrates this SQL option as a `SQLConfEntry` to fix this issue. Author: Cheng Lian <lian@databricks.com> Closes #8925 from liancheng/spark-10845/hive-version-conf.
* [SPARK-10760] [SPARKR] SparkR glm: the documentation in examples - family ↵Narine Kokhlikyan2015-09-251-1/+2
| | | | | | | | | | | | | | | | | | | | | argument is missing Hi everyone, Since the family argument is required for the glm function, the execution of: model <- glm(Sepal_Length ~ Sepal_Width, df) is failing. I've fixed the documentation by adding the family argument and also added the summay(model) which will show the coefficients for the model. Thanks, Narine Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Closes #8870 from NarineK/sparkrml.
* [SPARK-9681] [ML] Support R feature interactions in RFormulaEric Liang2015-09-2510-60/+362
| | | | | | | | | | | | This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`). To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms. mengxr Author: Eric Liang <ekl@databricks.com> Closes #8830 from ericl/interaction-2.
* [SPARK-9852] Let reduce tasks fetch multiple map output partitionsMatei Zaharia2015-09-249-124/+306
| | | | | | | | | | | | | This makes two changes: - Allow reduce tasks to fetch multiple map output partitions -- this is a pretty small change to HashShuffleFetcher - Move shuffle locality computation out of DAGScheduler and into ShuffledRDD / MapOutputTracker; this was needed because the code in DAGScheduler wouldn't work for RDDs that fetch multiple map output partitions from each reduce task I also added an AdaptiveSchedulingSuite that creates RDDs depending on multiple map output partitions. Author: Matei Zaharia <matei@databricks.com> Closes #8844 from mateiz/spark-9852.
* [SPARK-10761] Refactor DiskBlockObjectWriter to not require BlockIdJosh Rosen2015-09-248-30/+27
| | | | | | | | The DiskBlockObjectWriter constructor took a BlockId parameter but never used it. As part of some general cleanup in these interfaces, this patch refactors its constructor to eliminate this parameter. Author: Josh Rosen <joshrosen@databricks.com> Closes #8871 from JoshRosen/disk-block-object-writer-blockid-cleanup.
* [SPARK-10705] [SQL] Avoid using external rows in DataFrame.toJSONLiang-Chi Hsieh2015-09-243-85/+3
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-10705 As described in the JIRA ticket, `DataFrame.toJSON` uses `DataFrame.mapPartitions`, which converts internal rows to external rows. We should use `queryExecution.toRdd.mapPartitions` that directly uses internal rows for better performance. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8865 from viirya/df-tojson-internalrow.
* [SPARK-10765] [SQL] use new aggregate interface for hive UDAFWenchen Fan2015-09-246-86/+129
| | | | | | Author: Wenchen Fan <cloud0fan@163.com> Closes #8874 from cloud-fan/hive-agg.
* Revert "[SPARK-6028][Core]A new RPC implemetation based on the network module"Xiangrui Meng2015-09-2431-1708/+68
| | | | This reverts commit 084e4e126211d74a79e8dbd2d0e604dd3c650822.
* [SPARK-10763] [ML] [JAVA] [TEST] Update Java MLLIB/ML tests to use ↵Holden Karau2015-09-2310-39/+42
| | | | | | | | | | simplified dataframe construction As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now have an easier way to create dataframes from local Java lists. Lets update the tests to use those. Author: Holden Karau <holden@pigscanfly.ca> Closes #8886 from holdenk/SPARK-10763-update-java-mllib-ml-tests-to-use-simplified-dataframe-construction.
* [SPARK-10692] [STREAMING] Expose failureReasons in BatchInfo for streaming ↵zsxwing2015-09-234-16/+115
| | | | | | | | | | | UI to clear failed batches Slightly modified version of #8818, all credit goes to zsxwing Author: zsxwing <zsxwing@gmail.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8892 from tdas/SPARK-10692.
* [SPARK-10474] [SQL] Aggregation fails to allocate memory for pointer array ↵Andrew Or2015-09-233-52/+16
| | | | | | | | | | | | | | (round 2) This patch reverts most of the changes in a previous fix #8827. The real cause of the issue is that in `TungstenAggregate`'s prepare method we only reserve 1 page, but later when we switch to sort-based aggregation we try to acquire 1 page AND a pointer array. The longer-term fix should be to reserve also the pointer array, but for now ***we will simply not track the pointer array***. (Note that elsewhere we already don't track the pointer array, e.g. [here](https://github.com/apache/spark/blob/a18208047f06a4244703c17023bb20cbe1f59d73/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java#L88)) Note: This patch reuses the unit test added in #8827 so it doesn't show up in the diff. Author: Andrew Or <andrew@databricks.com> Closes #8888 from andrewor14/dont-track-pointer-array.
* [SPARK-6028][Core]A new RPC implemetation based on the network modulezsxwing2015-09-2331-68/+1708
| | | | | | | | Design doc: https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing Author: zsxwing <zsxwing@gmail.com> Closes #6457 from zsxwing/new-rpc.
* [SPARK-10731] [SQL] Delegate to Scala's DataFrame.take implementation in ↵Reynold Xin2015-09-234-12/+25
| | | | | | | | | | | | Python DataFrame. Python DataFrame.head/take now requires scanning all the partitions. This pull request changes them to delegate the actual implementation to Scala DataFrame (by calling DataFrame.take). This is more of a hack for fixing this issue in 1.5.1. A more proper fix is to change executeCollect and executeTake to return InternalRow rather than Row, and thus eliminate the extra round-trip conversion. Author: Reynold Xin <rxin@databricks.com> Closes #8876 from rxin/SPARK-10731.
* [SPARK-10699] [ML] Support checkpointInterval can be disabledYanbo Liang2015-09-236-10/+9
| | | | | | | | | Currently use can set ```checkpointInterval``` to specify how often should the cache be check-pointed. But we also need the function that users can disable it. This PR supports that users can disable checkpoint if user setting ```checkpointInterval = -1```. We also add documents for GBT ```cacheNodeIds``` to make users can understand more clearly about checkpoint. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8820 from yanboliang/spark-10699.
* [SPARK-10686] [ML] Add quantilesCol to AFTSurvivalRegressionYanbo Liang2015-09-232-34/+91
| | | | | | | | By default ```quantilesCol``` should be empty. If ```quantileProbabilities``` is set, we should append quantiles as a new column (of type Vector). Author: Yanbo Liang <ybliang8@gmail.com> Closes #8836 from yanboliang/spark-10686.
* [SPARK-9715] [ML] Store numFeatures in all ML PredictionModel typessethah2015-09-2325-50/+130
| | | | | | | | All prediction models should store `numFeatures` indicating the number of features the model was trained on. Default value of -1 added for backwards compatibility. Author: sethah <seth.hendrickson16@gmail.com> Closes #8675 from sethah/SPARK-9715.
* [SPARK-10403] Allow UnsafeRowSerializer to work with tungsten-sort ↵Josh Rosen2015-09-232-18/+27
| | | | | | | | | | ShuffleManager This patch attempts to fix an issue where Spark SQL's UnsafeRowSerializer was incompatible with the `tungsten-sort` ShuffleManager. Author: Josh Rosen <joshrosen@databricks.com> Closes #8873 from JoshRosen/SPARK-10403.
* [SPARK-10721] Log warning when file deletion failstedyu2015-09-2315-27/+80
| | | | | | Author: tedyu <yuzhihong@gmail.com> Closes #8843 from tedyu/master.
* [SPARK-10769] [STREAMING] [TESTS] Fix ↵zsxwing2015-09-231-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | o.a.s.streaming.CheckpointSuite.maintains rate controller Fixed the following failure in https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1787/testReport/junit/org.apache.spark.streaming/CheckpointSuite/recovery_maintains_rate_controller/ ``` sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 660 times over 10.000044392000001 seconds. Last failure message: 9223372036854775807 did not equal 200. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply$mcV$sp(CheckpointSuite.scala:413) at org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396) at org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ``` In this test, it calls `advanceTimeWithRealDelay(ssc, 2)` to run two batch jobs. However, one race condition is these two jobs can finish before the receiver is registered. Then `UpdateRateLimit` won't be sent to the receiver and `getDefaultBlockGeneratorRateLimit` cannot be updated. Here are the logs related to this issue: ``` 15/09/22 19:28:26.154 pool-1-thread-1-ScalaTest-running-CheckpointSuite INFO CheckpointSuite: Manual clock before advancing = 2500 15/09/22 19:28:26.869 JobScheduler INFO JobScheduler: Finished job streaming job 3000 ms.0 from job set of time 3000 ms 15/09/22 19:28:26.869 JobScheduler INFO JobScheduler: Total delay: 1442975303.869 s for time 3000 ms (execution: 0.711 s) 15/09/22 19:28:26.873 JobScheduler INFO JobScheduler: Finished job streaming job 3500 ms.0 from job set of time 3500 ms 15/09/22 19:28:26.873 JobScheduler INFO JobScheduler: Total delay: 1442975303.373 s for time 3500 ms (execution: 0.004 s) 15/09/22 19:28:26.879 sparkDriver-akka.actor.default-dispatcher-3 INFO ReceiverTracker: Registered receiver for stream 0 from localhost:57749 15/09/22 19:28:27.154 pool-1-thread-1-ScalaTest-running-CheckpointSuite INFO CheckpointSuite: Manual clock after advancing = 3500 ``` `advanceTimeWithRealDelay(ssc, 2)` triggered job 3000ms and 3500ms but the receiver was registered after job 3000ms and 3500ms finished. So we should make sure the receiver online before running `advanceTimeWithRealDelay(ssc, 2)`. Author: zsxwing <zsxwing@gmail.com> Closes #8877 from zsxwing/SPARK-10769.
* [SPARK-10224] [STREAMING] Fix the issue that blockIntervalTimer won't call ↵zsxwing2015-09-233-9/+100
| | | | | | | | | | | | | | | | | updateCurrentBuffer when stopping `blockIntervalTimer.stop(interruptTimer = false)` doesn't guarantee calling `updateCurrentBuffer`. So it's possible that `blockIntervalTimer` will exit when `updateCurrentBuffer` is not empty. Then the data in `currentBuffer` will be lost. To reproduce it, you can add `Thread.sleep(200)` in this line (https://github.com/apache/spark/blob/69c9c177160e32a2fbc9b36ecc52156077fca6fc/streaming/src/main/scala/org/apache/spark/streaming/util/RecurringTimer.scala#L100) and run `StreamingContexSuite`. I cannot write a unit test to reproduce it because I cannot find an approach to force `RecurringTimer` suspend at this line for a few milliseconds. There was a failure in Jenkins here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41455/console This PR updates RecurringTimer to make sure `stop(interruptTimer = false)` will call `callback` at least once after the `stop` method is called. Author: zsxwing <zsxwing@gmail.com> Closes #8417 from zsxwing/SPARK-10224.
* [SPARK-10652] [SPARK-10742] [STREAMING] Set meaningful job descriptions for ↵Tathagata Das2015-09-229-29/+179
| | | | | | | | | | | | | | | | | | | | | | all streaming jobs Here is the screenshot after adding the job descriptions to threads that run receivers and the scheduler thread running the batch jobs. ## All jobs page * Added job descriptions with links to relevant batch details page ![image](https://cloud.githubusercontent.com/assets/663212/9924165/cda4a372-5cb1-11e5-91ca-d43a32c699e9.png) ## All stages page * Added stage descriptions with links to relevant batch details page ![image](https://cloud.githubusercontent.com/assets/663212/9923814/2cce266a-5cae-11e5-8a3f-dad84d06c50e.png) ## Streaming batch details page * Added the +details link ![image](https://cloud.githubusercontent.com/assets/663212/9921977/24014a32-5c98-11e5-958e-457b6c38065b.png) Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8791 from tdas/SPARK-10652.
* [SPARK-10663] Removed unnecessary invocation of DataFrame.toDF method.Matt Hagen2015-09-221-1/+1
| | | | | | | | | | | | The Scala example under the "Example: Pipeline" heading in this document initializes the "test" variable to a DataFrame. Because test is already a DF, there is not need to call test.toDF as the example does in a subsequent line: model.transform(test.toDF). So, I removed the extraneous toDF invocation. Author: Matt Hagen <anonz3000@gmail.com> Closes #8875 from hagenhaus/SPARK-10663.