aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Fix indentation for the previous patch.Reynold Xin2016-01-071-10/+8
|
* [SPARK-12317][SQL] Support units (m,k,g) in SQLConfKevin Yu2016-01-072-1/+60
| | | | | | | | | | | | | | This PR is continue from previous closed PR 10314. In this PR, SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE will be taken memory string conventions as input. For example, the user can now specify 10g for SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE in SQLConf file. marmbrus srowen : Can you help review this code changes ? Thanks. Author: Kevin Yu <qyu@us.ibm.com> Closes #10629 from kevinyu98/spark-12317.
* [SPARK-12591][STREAMING] Register OpenHashMapBasedStateMap for KryoShixiong Zhu2016-01-075-41/+174
| | | | | | | | The default serializer in Kryo is FieldSerializer and it ignores transient fields and never calls `writeObject` or `readObject`. So we should register OpenHashMapBasedStateMap using `DefaultSerializer` to make it work with Kryo. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10609 from zsxwing/SPARK-12591.
* [SPARK-12507][STREAMING][DOCUMENT] Expose closeFileAfterWrite and ↵Shixiong Zhu2016-01-072-7/+23
| | | | | | | | | | allowBatching configurations for Streaming /cc tdas brkyvz Author: Shixiong Zhu <shixiong@databricks.com> Closes #10453 from zsxwing/streaming-conf.
* [SPARK-12604][CORE] Addendum - use casting vs mapValues for countBy{Key,Value}Sean Owen2016-01-072-2/+2
| | | | | | | | Per rxin, let's use the casting for countByKey and countByValue as well. Let's see if this passes. Author: Sean Owen <sowen@cloudera.com> Closes #10641 from srowen/SPARK-12604.2.
* [SPARK-12510][STREAMING] Refactor ActorReceiver to support JavaShixiong Zhu2016-01-076-20/+202
| | | | | | | | | | | | | This PR includes the following changes: 1. Rename `ActorReceiver` to `ActorReceiverSupervisor` 2. Remove `ActorHelper` 3. Add a new `ActorReceiver` for Scala and `JavaActorReceiver` for Java 4. Add `JavaActorWordCount` example Author: Shixiong Zhu <shixiong@databricks.com> Closes #10457 from zsxwing/java-actor-stream.
* [SPARK-12580][SQL] Remove string concatenations from usage and extended in ↵Kazuaki Ishizaki2016-01-072-25/+25
| | | | | | | | | | | | | | @ExpressionDescription Use multi-line string literals for ExpressionDescription with ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit`` The policy is here, as describe at https://github.com/apache/spark/pull/10488 Let's use multi-line string literals. If we have to have a line with more than 100 characters, let's use ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit`` to just bypass the line number requirement. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #10524 from kiszk/SPARK-12580.
* [SPARK-12598][CORE] bug in setMinPartitionsDarek Blasiak2016-01-071-3/+2
| | | | | | | | There is a bug in the calculation of ```maxSplitSize```. The ```totalLen``` should be divided by ```minPartitions``` and not by ```files.size```. Author: Darek Blasiak <darek.blasiak@640labs.com> Closes #10546 from datafarmer/setminpartitionsbug.
* [STREAMING][MINOR] More contextual information in logs + minor code i…Jacek Laskowski2016-01-0714-74/+69
| | | | | | | | | | …mprovements Please review and merge at your convenience. Thanks! Author: Jacek Laskowski <jacek@japila.pl> Closes #10595 from jaceklaskowski/streaming-minor-fixes.
* [MINOR] Fix for BUILD FAILURE for Scala 2.11Jacek Laskowski2016-01-071-18/+1
| | | | | | | | | | It was introduced in 917d3fc069fb9ea1c1487119c9c12b373f4f9b77 /cc cloud-fan rxin Author: Jacek Laskowski <jacek@japila.pl> Closes #10636 from jaceklaskowski/fix-for-build-failure-2.11.
* [SPARK-12662][SQL] Fix DataFrame.randomSplit to avoid creating overlapping ↵Sameer Agarwal2016-01-072-1/+28
| | | | | | | | | | | | splits https://issues.apache.org/jira/browse/SPARK-12662 cc yhuai Author: Sameer Agarwal <sameer@databricks.com> Closes #10626 from sameeragarwal/randomsplit.
* [SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not Nonezero3232016-01-072-1/+13
| | | | | | | | If initial model passed to GMM is not empty it causes net.razorvine.pickle.PickleException. It can be fixed by converting initialModel.weights to list. Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #10644 from zero323/SPARK-12006.
* [STREAMING][DOCS][EXAMPLES] Minor fixesJacek Laskowski2016-01-072-10/+8
| | | | | | Author: Jacek Laskowski <jacek@japila.pl> Closes #10603 from jaceklaskowski/streaming-actor-custom-receiver.
* [SPARK-12542][SQL] support except/intersect in HiveQlDavies Liu2016-01-065-5/+65
| | | | | | | | Parse the SQL query with except/intersect in FROM clause for HivQL. Author: Davies Liu <davies@databricks.com> Closes #10622 from davies/intersect.
* [SPARK-12295] [SQL] external spilling for window functionsDavies Liu2016-01-066-94/+276
| | | | | | | | | | This PR manage the memory used by window functions (buffered rows), also enable external spilling. After this PR, we can run window functions on a partition with hundreds of millions of rows with only 1G. Author: Davies Liu <davies@databricks.com> Closes #10605 from davies/unsafe_window.
* [DOC] fix 'spark.memory.offHeap.enabled' default value to falsezzcclp2016-01-061-1/+1
| | | | | | | | modify 'spark.memory.offHeap.enabled' default value to false Author: zzcclp <xm_zzc@sina.com> Closes #10633 from zzcclp/fix_spark.memory.offHeap.enabled_default_value.
* Revert "[SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None"Yin Huai2016-01-062-13/+1
| | | | | | | | This reverts commit fcd013cf70e7890aa25a8fe3cb6c8b36bf0e1f04. Author: Yin Huai <yhuai@databricks.com> Closes #10632 from yhuai/pythonStyle.
* [SPARK-12678][CORE] MapPartitionsRDD clearDependenciesGuillaume Poulin2016-01-061-1/+6
| | | | | | | | | MapPartitionsRDD was keeping a reference to `prev` after a call to `clearDependencies` which could lead to memory leak. Author: Guillaume Poulin <poulin.guillaume@gmail.com> Closes #10623 from gpoulin/map_partition_deps.
* [SPARK-12673][UI] Add missing uri prepending for job descriptionjerryshao2016-01-061-3/+3
| | | | | | | | | | Otherwise the url will be failed to proxy to the right one if in YARN mode. Here is the screenshot: ![screen shot 2016-01-06 at 5 28 26 pm](https://cloud.githubusercontent.com/assets/850797/12139632/bbe78ecc-b49c-11e5-8932-94e8b3622a09.png) Author: jerryshao <sshao@hortonworks.com> Closes #10618 from jerryshao/SPARK-12673.
* [SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0Josh Rosen2016-01-0612-589/+48
| | | | | | | | | | | | This PR removes `spark.cleaner.ttl` and the associated TTL-based metadata cleaning code. Now that we have the `ContextCleaner` and a timer to trigger periodic GCs, I don't think that `spark.cleaner.ttl` is necessary anymore. The TTL-based cleaning isn't enabled by default, isn't included in our end-to-end tests, and has been a source of user confusion when it is misconfigured. If the TTL is set too low, data which is still being used may be evicted / deleted, leading to hard to diagnose bugs. For all of these reasons, I think that we should remove this functionality in Spark 2.0. Additional benefits of doing this include marginally reduced memory usage, since we no longer need to store timetsamps in hashmaps, and a handful fewer threads. Author: Josh Rosen <joshrosen@databricks.com> Closes #10534 from JoshRosen/remove-ttl-based-cleaning.
* [SPARK-12663][MLLIB] More informative error message in MLUtils.loadLibSVMFileRobert Dodier2016-01-061-1/+2
| | | | | | | | | | This PR contains 1 commit which resolves [SPARK-12663](https://issues.apache.org/jira/browse/SPARK-12663). For the record, I got a positive response from 2 people when I floated this idea on devspark.apache.org on 2015-10-23. [Link to archived discussion.](http://apache-spark-developers-list.1001551.n3.nabble.com/slightly-more-informative-error-message-in-MLUtils-loadLibSVMFile-td14764.html) Author: Robert Dodier <robert_dodier@users.sourceforge.net> Closes #10611 from robert-dodier/loadlibsvmfile-error-msg-branch.
* [SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet ↵Nong Li2016-01-062-0/+278
| | | | | | | | | | | | | | | scan benchmarks. [SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks. We've run benchmarks ad hoc to measure the scanner performance. We will continue to invest in this and it makes sense to get these benchmarks into code. This adds a simple benchmarking utility to do this. Author: Nong Li <nong@databricks.com> Author: Nong <nongli@gmail.com> Closes #10589 from nongli/spark-12640.
* [SPARK-12604][CORE] Java count(AprroxDistinct)ByKey methods return Scala ↵Sean Owen2016-01-065-42/+49
| | | | | | | | | | Long not Java Change Java countByKey, countApproxDistinctByKey return types to use Java Long, not Scala; update similar methods for consistency on java.long.Long.valueOf with no API change Author: Sean Owen <sowen@cloudera.com> Closes #10554 from srowen/SPARK-12604.
* [SPARK-12539][SQL] support writing bucketed tableWenchen Fan2016-01-0618-117/+626
| | | | | | | | | | | | | | | | | | | | | | This PR adds bucket write support to Spark SQL. User can specify bucketing columns, numBuckets and sorting columns with or without partition columns. For example: ``` df.write.partitionBy("year").bucketBy(8, "country").sortBy("amount").saveAsTable("sales") ``` When bucketing is used, we will calculate bucket id for each record, and group the records by bucket id. For each group, we will create a file with bucket id in its name, and write data into it. For each bucket file, if sorting columns are specified, the data will be sorted before write. Note that there may be multiply files for one bucket, as the data is distributed. Currently we store the bucket metadata at hive metastore in a non-hive-compatible way. We use different bucketing hash function compared to hive, so we can't be compatible anyway. Limitations: * Can't write bucketed data without hive metastore. * Can't insert bucketed data into existing hive tables. Author: Wenchen Fan <wenchen@databricks.com> Closes #10498 from cloud-fan/bucket-write.
* [SPARK-12681] [SQL] split IdentifiersParser.g into two filesDavies Liu2016-01-063-516/+566
| | | | | | | | | | To avoid to have a huge Java source (over 64K loc), that can't be compiled. cc hvanhovell Author: Davies Liu <davies@databricks.com> Closes #10624 from davies/split_ident.
* Revert "[SPARK-12672][STREAMING][UI] Use the uiRoot function instead of ↵Shixiong Zhu2016-01-061-3/+2
| | | | | | default root path to gain the streaming batch url." This reverts commit 19e4e9febf9bb4fd69f6d7bc13a54844e4e096f1. Will merge #10618 instead.
* [SPARK-12672][STREAMING][UI] Use the uiRoot function instead of default root ↵huangzhaowei2016-01-061-2/+3
| | | | | | | | path to gain the streaming batch url. Author: huangzhaowei <carlmartinmax@gmail.com> Closes #10617 from SaintBacchus/SPARK-12672.
* [SPARK-12617][PYSPARK] Move Py4jCallbackConnectionCleaner to StreamingShixiong Zhu2016-01-062-61/+63
| | | | | | | | Move Py4jCallbackConnectionCleaner to Streaming because the callback server starts only in StreamingContext. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10621 from zsxwing/SPARK-12617-2.
* [SPARK-12368][ML][DOC] Better doc for the binary classification evaluator' ↵BenFradet2016-01-062-4/+3
| | | | | | | | | | | | | | | metricName For the BinaryClassificationEvaluator, the scaladoc doesn't mention that "areaUnderPR" is supported, only that the default is "areadUnderROC". Also, in the documentation, it is said that: "The default metric used to choose the best ParamMap can be overriden by the setMetric method in each of these evaluators." However, the method is called setMetricName. This PR aims to fix both issues. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10328 from BenFradet/SPARK-12368.
* [SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not Nonezero3232016-01-062-1/+13
| | | | | | | | If initial model passed to GMM is not empty it causes `net.razorvine.pickle.PickleException`. It can be fixed by converting `initialModel.weights` to `list`. Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #9986 from zero323/SPARK-12006.
* [SPARK-12573][SPARK-12574][SQL] Move SQL Parser from Hive to CatalystHerman van Hovell2016-01-0629-2806/+2107
| | | | | | | | | | | | | | | | | This PR moves a major part of the new SQL parser to Catalyst. This is a prelude to start using this parser for all of our SQL parsing. The following key changes have been made: The ANTLR Parser & Supporting classes have been moved to the Catalyst project. They are now part of the ```org.apache.spark.sql.catalyst.parser``` package. These classes contained quite a bit of code that was originally from the Hive project, I have added aknowledgements whenever this applied. All Hive dependencies have been factored out. I have also taken this chance to clean-up the ```ASTNode``` class, and to improve the error handling. The HiveQl object that provides the functionality to convert an AST into a LogicalPlan has been refactored into three different classes, one for every SQL sub-project: - ```CatalystQl```: This implements Query and Expression parsing functionality. - ```SparkQl```: This is a subclass of CatalystQL and provides SQL/Core only functionality such as Explain and Describe. - ```HiveQl```: This is a subclass of ```SparkQl``` and this adds Hive-only functionality to the parser such as Analyze, Drop, Views, CTAS & Transforms. This class still depends on Hive. cc rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10583 from hvanhovell/SPARK-12575.
* [SPARK-11815][ML][PYSPARK] PySpark DecisionTreeClassifier & ↵Yanbo Liang2016-01-062-10/+17
| | | | | | | | | | DecisionTreeRegressor should support setSeed PySpark ```DecisionTreeClassifier``` & ```DecisionTreeRegressor``` should support ```setSeed``` like what we do at Scala side. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9807 from yanboliang/spark-11815.
* [SPARK-11945][ML][PYSPARK] Add computeCost to KMeansModel for PySpark spark.mlYanbo Liang2016-01-061-0/+10
| | | | | | | | Add ```computeCost``` to ```KMeansModel``` as evaluator for PySpark spark.ml. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9931 from yanboliang/SPARK-11945.
* [SPARK-11531][ML] SparseVector error MsgJoshi2016-01-061-1/+3
| | | | | | | | | PySpark SparseVector should have "Found duplicate indices" error message Author: Joshi <rekhajoshm@gmail.com> Author: Rekha Joshi <rekhajoshm@gmail.com> Closes #9525 from rekhajoshm/SPARK-11531.
* [SPARK-7675][ML][PYSPARK] sparkml params type conversionHolden Karau2016-01-064-103/+148
| | | | | | | | | | | | | From JIRA: Currently, PySpark wrappers for spark.ml Scala classes are brittle when accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an integer); it must be set to "2.0" (a float). Fixing this is not trivial since there does not appear to be a natural place to insert the conversion before Python wrappers call Java's Params setter method. A possible fix will be to include a method "_checkType" to PySpark's Param class which checks the type, prints an error if needed, and converts types when relevant (e.g., int to float, or scipy matrix to array). The Java wrapper method which copies params to Scala can call this method when available. This fix instead checks the types at set time since I think failing sooner is better, but I can switch it around to check at copy time if that would be better. So far this only converts int to float and other conversions (like scipymatrix to array) are left for the future. Author: Holden Karau <holden@us.ibm.com> Closes #9581 from holdenk/SPARK-7675-PySpark-sparkml-Params-type-conversion.
* [SPARK-11878][SQL] Eliminate distribute by in case group by is present with ↵Yash Datta2016-01-062-0/+45
| | | | | | | | | | | | | exactly the same grouping expressi For queries like : select <> from table group by a distribute by a we can eliminate distribute by ; since group by will anyways do a hash partitioning Also applicable when user uses Dataframe API Author: Yash Datta <Yash.Datta@guavus.com> Closes #9858 from saucam/eliminatedistribute.
* [SPARK-12665][CORE][GRAPHX] Remove Vector, VectorSuite and ↵Kousuke Saruta2016-01-064-253/+7
| | | | | | | | | | GraphKryoRegistrator which are deprecated and no longer used Whole code of Vector.scala, VectorSuite.scala and GraphKryoRegistrator.scala are no longer used so it's time to remove them in Spark 2.0. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10613 from sarutak/SPARK-12665.
* [SPARK-12340][SQL] fix Int overflow in the SparkPlan.executeTake, RDD.take ↵QiangCai2016-01-064-14/+26
| | | | | | | | | | | | | and AsyncRDDActions.takeAsync I have closed pull request https://github.com/apache/spark/pull/10487. And I create this pull request to resolve the problem. spark jira https://issues.apache.org/jira/browse/SPARK-12340 Author: QiangCai <david.caiq@gmail.com> Closes #10562 from QiangCai/bugfix.
* [SPARK-12578][SQL] Distinct should not be silently ignored when used in an ↵Liang-Chi Hsieh2016-01-062-1/+22
| | | | | | | | | | | | aggregate function with OVER clause JIRA: https://issues.apache.org/jira/browse/SPARK-12578 Slightly update to Hive parser. We should keep the distinct keyword when used in an aggregate function with OVER clause. So the CheckAnalysis will detect it and throw exception later. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10557 from viirya/keep-distinct-hivesql.
* [SPARK-12393][SPARKR] Add read.text and write.text for SparkRYanbo Liang2016-01-065-1/+82
| | | | | | | | | Add ```read.text``` and ```write.text``` for SparkR. cc sun-rui felixcheung shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10348 from yanboliang/spark-12393.
* [SPARK-3873][TESTS] Import ordering fixes.Marcelo Vanzin2016-01-05281-575/+517
| | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10582 from vanzin/SPARK-3873-tests.
* [SPARK-3873][CORE] Import ordering fixes.Marcelo Vanzin2016-01-05158-250/+246
| | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10578 from vanzin/SPARK-3873-core.
* [SPARK-12659] fix NPE in UnsafeExternalSorter (used by cartesian product)Davies Liu2016-01-054-11/+45
| | | | | | | | | | | | Cartesian product use UnsafeExternalSorter without comparator to do spilling, it will NPE if spilling happens. This bug also hitted by #10605 cc JoshRosen Author: Davies Liu <davies@databricks.com> Closes #10606 from davies/fix_spilling.
* [SPARK-12504][SQL] Masking credentials in the sql plan explain output for ↵sureshthalamati2016-01-052-0/+27
| | | | | | | | | | | | JDBC data sources. This fix masks JDBC credentials in the explain output. URL patterns to specify credential seems to be vary between different databases. Added a new method to dialect to mask the credentials according to the database specific URL pattern. While adding tests I noticed explain output includes array variable for partitions ([Lorg.apache.spark.Partition;3ff74546,). Modified the code to include the first, and last partition information. Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #10452 from sureshthalamati/mask_jdbc_credentials_spark-12504.
* [SPARK-3873][SQL] Import ordering fixes.Marcelo Vanzin2016-01-05164-318/+301
| | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10573 from vanzin/SPARK-3873-sql.
* [SPARK-12041][ML][PYSPARK] Add columnSimilarities to IndexedRowMatrixKai Jiang2016-01-051-0/+14
| | | | | | | | Add `columnSimilarities` to IndexedRowMatrix for PySpark spark.mllib.linalg. Author: Kai Jiang <jiangkai@gmail.com> Closes #10158 from vectorijk/spark-12041.
* [SPARK-12453][STREAMING] Remove explicit dependency on aws-java-sdkBrianLondon2016-01-053-6/+1
| | | | | | | | | | | | Successfully ran kinesis demo on a live, aws hosted kinesis stream against master and 1.6 branches. For reasons I don't entirely understand it required a manual merge to 1.5 which I did as shown here: https://github.com/BrianLondon/spark/commit/075c22e89bc99d5e99be21f40e0d72154a1e23a2 The demo ran successfully on the 1.5 branch as well. According to `mvn dependency:tree` it is still pulling a fairly old version of the aws-java-sdk (1.9.37), but this appears to have fixed the kinesis regression in 1.5.2. Author: BrianLondon <brian@seatgeek.com> Closes #10492 from BrianLondon/remove-only.
* [SPARK-12450][MLLIB] Un-persist broadcasted variables in KMeansRJ Nowling2016-01-051-0/+8
| | | | | | | | SPARK-12450 . Un-persist broadcasted variables in KMeans. Author: RJ Nowling <rnowling@gmail.com> Closes #10415 from rnowling/spark-12450.
* [SPARK-12570][ML][DOC] DecisionTreeRegressor: provide variance of ↵Yanbo Liang2016-01-051-1/+10
| | | | | | | | | | | | prediction: user guide update Update user guide doc for ```DecisionTreeRegressor``` providing variance of prediction. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #10594 from yanboliang/spark-12570.
* [SPARK-12511] [PYSPARK] [STREAMING] Make sure ↵Shixiong Zhu2016-01-053-10/+38
| | | | | | | | | | | | PythonDStream.registerSerializer is called only once There is an issue that Py4J's PythonProxyHandler.finalize blocks forever. (https://github.com/bartdag/py4j/pull/184) Py4j will create a PythonProxyHandler in Java for "transformer_serializer" when calling "registerSerializer". If we call "registerSerializer" twice, the second PythonProxyHandler will override the first one, then the first one will be GCed and trigger "PythonProxyHandler.finalize". To avoid that, we should not call"registerSerializer" more than once, so that "PythonProxyHandler" in Java side won't be GCed. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10514 from zsxwing/SPARK-12511.