| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
parameters
Query `SELECT named_struct(lower("AA"), "12", lower("Bb"), "13") FROM src LIMIT 1` will throw exception, some of the Hive Generic UDF/UDAF requires the input object inspector is `ConstantObjectInspector`, however, we won't get that before the expression optimization executed. (Constant Folding).
This PR is a work around to fix this. (As ideally, the `output` of LogicalPlan should be identical before and after Optimization).
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3109 from chenghao-intel/optimized and squashes the following commits:
487ff79 [Cheng Hao] rebase to the latest master & update the unittest
(cherry picked from commit 84d79ee9ec47465269f7b0a7971176da93c96f3f)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy.
numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927.
It also complicate the code a lot, so we may should remove numpy from RDDSampler.
I also did some benchmark to verify that:
```
>>> from pyspark.mllib.random import RandomRDDs
>>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache()
>>> rdd.count() # cache it
>>> rdd.sample(True, 0.9).count() # measure this line
```
the results:
|withReplacement | random | numpy.random |
------- | ------------ | -------
|True | 1.5 s| 1.4 s|
|False| 0.6 s | 0.8 s|
closes #2313
Note: this patch including some commits that not mirrored to github, it will be OK after it catches up.
Author: Davies Liu <davies@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes #3351 from davies/numpy and squashes the following commits:
5c438d7 [Davies Liu] fix comment
c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477
98eb31b [Xiangrui Meng] make poisson sampling slightly faster
ee17d78 [Davies Liu] remove = for float
13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy
f583023 [Davies Liu] fix tests
51649f5 [Davies Liu] remove numpy in RDDSampler
78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
f5fdf63 [Davies Liu] fix bug with int in weights
4dfa2cd [Davies Liu] refactor
f866bcf [Davies Liu] remove unneeded change
c7a2007 [Davies Liu] switch to python implementation
95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
0d9b256 [Davies Liu] refactor
1715ee3 [Davies Liu] address comments
41fce54 [Davies Liu] randomSplit()
(cherry picked from commit d39f2e9c683a4ab78b29eb3c5668325bf8568e8c)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Sample code in the description of SchemaRDD.where is not correct
Author: Jacky Li <jacky.likun@gmail.com>
Closes #3344 from jackylk/patch-6 and squashes the following commits:
62cd126 [Jacky Li] [SQL] fix function description mistake
(cherry picked from commit ad5f1f3ca240473261162c06ffc5aa70d15a5991)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Hive supports the `explain` the CTAS, which was supported by Spark SQL previously, however, seems it was reverted after the code refactoring in HiveQL.
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3357 from chenghao-intel/explain and squashes the following commits:
7aace63 [Cheng Hao] Support the CTAS in EXPLAIN command
(cherry picked from commit 6aa0fc9f4d95f09383cbcb5f79166c60697e6683)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Executing sum distinct for empty table throws `java.lang.UnsupportedOperationException: empty.reduceLeft`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #3184 from ueshin/issues/SPARK-4318 and squashes the following commits:
8168c42 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4318
66fdb0a [Takuya UESHIN] Re-refine aggregate functions.
6186eb4 [Takuya UESHIN] Fix Sum of GeneratedAggregate.
d2975f6 [Takuya UESHIN] Refine Sum and Average of GeneratedAggregate.
1bba675 [Takuya UESHIN] Refine Sum, SumDistinct and Average functions.
917e533 [Takuya UESHIN] Use aggregate instead of groupBy().
1a5f874 [Takuya UESHIN] Add tests to be executed as non-partial aggregation.
a5a57d2 [Takuya UESHIN] Fix empty Average.
22799dc [Takuya UESHIN] Fix empty Sum and SumDistinct.
65b7dd2 [Takuya UESHIN] Fix empty sum distinct.
(cherry picked from commit 2c2e7a44db2ebe44121226f3eac924a0668b991a)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The relational operator '<=>' is not working in Spark SQL. Same works in Spark HiveQL
Author: ravipesala <ravindra.pesala@huawei.com>
Closes #3387 from ravipesala/<=> and squashes the following commits:
7198e90 [ravipesala] Supporting relational operator '<=>' in Spark SQL
(cherry picked from commit 98e9419784a9ad5096cfd563fa9a433786a90bd4)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
```
class RandomForestModel
| A model trained by RandomForest
|
| numTrees(self)
| Get number of trees in forest.
|
| predict(self, x)
| Predict values for a single data point or an RDD of points using the model trained.
|
| toDebugString(self)
| Full model
|
| totalNumNodes(self)
| Get total number of nodes, summed over all trees in the forest.
|
class RandomForest
| trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None):
| Method to train a decision tree model for binary or multiclass classification.
|
| :param data: Training dataset: RDD of LabeledPoint.
| Labels should take values {0, 1, ..., numClasses-1}.
| :param numClassesForClassification: number of classes for classification.
| :param categoricalFeaturesInfo: Map storing arity of categorical features.
| E.g., an entry (n -> k) indicates that feature n is categorical
| with k categories indexed from 0: {0, 1, ..., k-1}.
| :param numTrees: Number of trees in the random forest.
| :param featureSubsetStrategy: Number of features to consider for splits at each node.
| Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
| If "auto" is set, this parameter is set based on numTrees:
| if numTrees == 1, set to "all";
| if numTrees > 1 (forest) set to "sqrt".
| :param impurity: Criterion used for information gain calculation.
| Supported values: "gini" (recommended) or "entropy".
| :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
| 1 internal node + 2 leaf nodes. (default: 4)
| :param maxBins: maximum number of bins used for splitting features (default: 100)
| :param seed: Random seed for bootstrapping and choosing feature subsets.
| :return: RandomForestModel that can be used for prediction
|
| trainRegressor(cls, data, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, seed=None):
| Method to train a decision tree model for regression.
|
| :param data: Training dataset: RDD of LabeledPoint.
| Labels are real numbers.
| :param categoricalFeaturesInfo: Map storing arity of categorical features.
| E.g., an entry (n -> k) indicates that feature n is categorical
| with k categories indexed from 0: {0, 1, ..., k-1}.
| :param numTrees: Number of trees in the random forest.
| :param featureSubsetStrategy: Number of features to consider for splits at each node.
| Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
| If "auto" is set, this parameter is set based on numTrees:
| if numTrees == 1, set to "all";
| if numTrees > 1 (forest) set to "onethird".
| :param impurity: Criterion used for information gain calculation.
| Supported values: "variance".
| :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
| 1 internal node + 2 leaf nodes.(default: 4)
| :param maxBins: maximum number of bins used for splitting features (default: 100)
| :param seed: Random seed for bootstrapping and choosing feature subsets.
| :return: RandomForestModel that can be used for prediction
|
```
Author: Davies Liu <davies@databricks.com>
Closes #3320 from davies/forest and squashes the following commits:
8003dfc [Davies Liu] reorder
53cf510 [Davies Liu] fix docs
4ca593d [Davies Liu] fix docs
e0df852 [Davies Liu] fix docs
0431746 [Davies Liu] rebased
2b6f239 [Davies Liu] Merge branch 'master' of github.com:apache/spark into forest
885abee [Davies Liu] address comments
dae7fc0 [Davies Liu] address comments
89a000f [Davies Liu] fix docs
565d476 [Davies Liu] add python api for random forest
(cherry picked from commit 1c53a5db993193122bfa79574d2540149fe2cc08)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here's a simple fix for SchemaRDD to JSON.
Author: Dan McClary <dan.mcclary@gmail.com>
Closes #3213 from dwmclary/SPARK-4228 and squashes the following commits:
d714e1d [Dan McClary] fixed PEP 8 error
cac2879 [Dan McClary] move pyspark comment and doctest to correct location
f9471d3 [Dan McClary] added pyspark doc and doctest
6598cee [Dan McClary] adding complex type queries
1a5fd30 [Dan McClary] removing SPARK-4228 from SQLQuerySuite
4a651f0 [Dan McClary] cleaned PEP and Scala style failures. Moved tests to JsonSuite
47ceff6 [Dan McClary] cleaned up scala style issues
2ee1e70 [Dan McClary] moved rowToJSON to JsonRDD
4387dd5 [Dan McClary] Added UserDefinedType, cleaned up case formatting
8f7bfb6 [Dan McClary] Map type added to SchemaRDD.toJSON
1b11980 [Dan McClary] Map and UserDefinedTypes partially done
11d2016 [Dan McClary] formatting and unicode deserialization default fixed
6af72d1 [Dan McClary] deleted extaneous comment
4d11c0c [Dan McClary] JsonFactory rewrite of toJSON for SchemaRDD
149dafd [Dan McClary] wrapped scala toJSON in sql.py
5e5eb1b [Dan McClary] switched to Jackson for JSON processing
6c94a54 [Dan McClary] added toJSON to pyspark SchemaRDD
aaeba58 [Dan McClary] added toJSON to pyspark SchemaRDD
1d171aa [Dan McClary] upated missing brace on if statement
319e3ba [Dan McClary] updated to upstream master with merged SPARK-4228
424f130 [Dan McClary] tests pass, ready for pull and PR
626a5b1 [Dan McClary] added toJSON to SchemaRDD
f7d166a [Dan McClary] added toJSON method
5d34e37 [Dan McClary] merge resolved
d6d19e9 [Dan McClary] pr example
(cherry picked from commit b8e6886fb8ff8f667fb7e600cd727d8649cad1d1)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR enables the Web UI storage tab to show the in-memory table name instead of the mysterious query plan string as the name of the in-memory columnar RDD.
Note that after #2501, a single columnar RDD can be shared by multiple in-memory tables, as long as their query results are the same. In this case, only the first cached table name is shown. For example:
```sql
CACHE TABLE first AS SELECT * FROM src;
CACHE TABLE second AS SELECT * FROM src;
```
The Web UI only shows "In-memory table first".
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3383)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #3383 from liancheng/columnar-rdd-name and squashes the following commits:
071907f [Cheng Lian] Fixes tests
12ddfa6 [Cheng Lian] Names in-memory columnar RDD with corresponding table name
(cherry picked from commit abf29187f0342b607fcefe269391d4db58d2a957)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
(backport for branch-1.2)
backport for branch-1.2 as per #3356
Author: zsxwing <zsxwing@gmail.com>
Closes #3376 from zsxwing/SPARK-4481-branch-1.2 and squashes the following commits:
53b94e8 [zsxwing] Fix the wrong description of updateFunc
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are some inconsistencies in the gradient boosting APIs. The target is a general boosting meta-algorithm, but the implementation is attached to trees. This was partially due to the delay of SPARK-1856. But for the 1.2 release, we should make the APIs consistent.
1. WeightedEnsembleModel -> private[tree] TreeEnsembleModel and renamed members accordingly.
1. GradientBoosting -> GradientBoostedTrees
1. Add RandomForestModel and GradientBoostedTreesModel and hide CombiningStrategy
1. Slightly refactored TreeEnsembleModel (Vote takes weights into consideration.)
1. Remove `trainClassifier` and `trainRegressor` from `GradientBoostedTrees` because they are the same as `train`
1. Rename class `train` method to `run` because it hides the static methods with the same name in Java. Deprecated `DecisionTree.train` class method.
1. Simplify BoostingStrategy and make sure the input strategy is not modified. Users should put algo and numClasses in treeStrategy. We create ensembleStrategy inside boosting.
1. Fix a bug in GradientBoostedTreesSuite with AbsoluteError
1. doc updates
manishamde jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #3374 from mengxr/SPARK-4486 and squashes the following commits:
7097251 [Xiangrui Meng] address joseph's comments
98dea09 [Xiangrui Meng] address manish's comments
4aae3b7 [Xiangrui Meng] add RandomForestModel and GradientBoostedTreesModel, hide CombiningStrategy
ea4c467 [Xiangrui Meng] fix unit tests
751da4e [Xiangrui Meng] rename class method train -> run
19030a5 [Xiangrui Meng] update boosting public APIs
(cherry picked from commit 15cacc81240eed8834b4730c5c6dc3238f003465)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
MetadataCleaner schedule task with a wrong param for delay time .
Author: Leolh <leosandylh@gmail.com>
Closes #3306 from Leolh/master and squashes the following commits:
4a21f4e [Leolh] Update MetadataCleaner.scala
(cherry picked from commit e216ffaead983274428052caa992b20760b2c5e0)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
**Summary.** Currently, we may spill many small files in `ExternalAppendOnlyMap` and `ExternalSorter`. The underlying root cause of this is summarized in [SPARK-4452](https://issues.apache.org/jira/browse/SPARK-4452). This PR does not address this root cause, but simply provides the guarantee that we never spill the in-memory data structure if its size is less than a configurable threshold of 5MB. This config is not documented because we don't want users to set it themselves, and it is not hard-coded because we need to change it in tests.
**Symptom.** Each spill is orders of magnitude smaller than 1MB, and there are many spills. In environments where the ulimit is set, this frequently causes "too many open file" exceptions observed in [SPARK-3633](https://issues.apache.org/jira/browse/SPARK-3633).
```
14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4792 B to disk (292769 spills so far)
14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4760 B to disk (292770 spills so far)
14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4520 B to disk (292771 spills so far)
14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4560 B to disk (292772 spills so far)
14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4792 B to disk (292773 spills so far)
14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4784 B to disk (292774 spills so far)
```
**Reproduction.** I ran the following on a small 4-node cluster with 512MB executors. Note that the back-to-back shuffle here is necessary for reasons described in [SPARK-4522](https://issues.apache.org/jira/browse/SPARK-4452). The second shuffle is a `reduceByKey` because it performs a map-side combine.
```
sc.parallelize(1 to 100000000, 100)
.map { i => (i, i) }
.groupByKey()
.reduceByKey(_ ++ _)
.count()
```
Before the change, I notice that each thread may spill up to 1000 times, and the size of each spill is on the order of 10KB. After the change, each thread spills only up to 20 times in the worst case, and the size of each spill is on the order of 1MB.
Author: Andrew Or <andrew@databricks.com>
Closes #3353 from andrewor14/avoid-small-spills and squashes the following commits:
49f380f [Andrew Or] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into avoid-small-spills
27d6966 [Andrew Or] Merge branch 'master' of github.com:apache/spark into avoid-small-spills
f4736e3 [Andrew Or] Fix tests
a919776 [Andrew Or] Avoid many small spills
(cherry picked from commit 0eb4a7fb0fa1fa56677488cbd74eb39e65317621)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
message
The check for maxResultSize > 0 is missing, results in failures. Also, error message needs to be improved so the developers know that there is a new parameter to be configured
Author: Nishkam Ravi <nravi@cloudera.com>
Author: nravi <nravi@c1704.halxg.cloudera.com>
Author: nishkamravi2 <nishkamravi@gmail.com>
Closes #3360 from nishkamravi2/master_nravi and squashes the following commits:
5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala
535295a [nishkamravi2] Update TaskSetManager.scala
3e1b616 [Nishkam Ravi] Modify test for maxResultSize
9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0)
5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
636a9ff [nishkamravi2] Update YarnAllocator.scala
8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
5ac2ec1 [Nishkam Ravi] Remove out
dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
1cf2d1e [nishkamravi2] Update YarnAllocator.scala
ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
(cherry picked from commit 73fedf5a6e662b640dfe29936753721988bff6ea)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This rebases PR 3368.
This commit fixes totalRegisteredExecutors update [SPARK-4478], so that we can correctly keep track of number of registered executors.
Author: Akshat Aranya <aaranya@quantcast.com>
Closes #3373 from coolfrood/topic/SPARK-4478 and squashes the following commits:
8a4d1e4 [Akshat Aranya] Added comment
150ae93 [Akshat Aranya] [SPARK-4478] Keep totalRegisteredExecutors up-to-date
(cherry picked from commit 9ccc53c72c5bcffcc121291710754e1e2d659341)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This pull request revises the programming guide to reflect changes in the GraphX API as well as the deprecated mapReduceTriplets operator.
Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Closes #3359 from jegonzal/GraphXProgrammingGuide and squashes the following commits:
4421964 [Joseph E. Gonzalez] updating documentation for graphx
(cherry picked from commit 377b06820934cab6d67f3a9182528c7f417a7d98)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit fixes a memory leak in JobProgressListener that I introduced in SPARK-2321 and adds a testing framework to ensure that it’s very difficult to inadvertently introduce new memory leaks.
This solution might be overkill, but the main idea is to partition JobProgressListener's state into three buckets: collections that should be empty once Spark is idle, collections that must obey some hard size limit, and collections that have a soft size limit (they can grow arbitrarily large when Spark is active but must shrink to fit within some bound after Spark becomes idle).
Based on this, we can write fairly generic tests that run workloads that submit more than `spark.ui.retainedStages` stages and `spark.ui.retainedJobs` jobs then check that these various collections' sizes obey their contracts.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #3372 from JoshRosen/SPARK-4495 and squashes the following commits:
c73fab5 [Josh Rosen] "data structures" -> collections
be72e81 [Josh Rosen] [SPARK-4495] Fix memory leaks in JobProgressListener
(cherry picked from commit 04d462f648aba7b18fc293b7189b86af70e421bc)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in the same way as TransformedDStream
In class TransformedDStream:
```scala
require(parents.length > 0, "List of DStreams to transform is empty")
require(parents.map(.ssc).distinct.size == 1, "Some of the DStreams have different contexts")
require(parents.map(.slideDuration).distinct.size == 1,
"Some of the DStreams have different slide durations")
```
In class UnionDStream:
```scala
if (parents.length == 0)
{ throw new IllegalArgumentException("Empty array of parents") }
if (parents.map(.ssc).distinct.size > 1)
{ throw new IllegalArgumentException("Array of parents have different StreamingContexts") }
if (parents.map(.slideDuration).distinct.size > 1)
{ throw new IllegalArgumentException("Array of parents have different slide times") }
```
The function is the same, but the realization is not. I think they shoule be the same.
Author: Yadong Qi <qiyadong2010@gmail.com>
Closes #3152 from watermen/bug-fix1 and squashes the following commits:
ed66db6 [Yadong Qi] Change transform to union
b6b3b8b [Yadong Qi] The same function should have the same realization.
(cherry picked from commit c3002c4a61c4fc5b966aa384c41c3cba33de0aa6)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit should be merged for 1.2 release.
cc tdas
Author: Ken Takagiwa <ugw.gi.world@gmail.com>
Closes #3311 from giwa/patch-3 and squashes the following commits:
ab474a8 [Ken Takagiwa] [DOC][PySpark][Streaming] Fix docstring for sphinx
(cherry picked from commit 9b7bbcef8863ecd69e7511825ef9c93d8632dac2)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If there some big broadcasts (or other object) in Python worker, the free memory could be used for sorting will be too small, then it will keep spilling small files into disks, finally failed with too many open files.
This PR try to delay the spilling until the used memory goes over limit and start to increase since last spilling, it will increase the size of spilling files, improve the stability and performance in this cases. (We also do this in ExternalAggregator).
Author: Davies Liu <davies@databricks.com>
Closes #3252 from davies/sort and squashes the following commits:
711fb6c [Davies Liu] improve sort spilling
(cherry picked from commit 73c8ea84a668f443eb18ce15ba97023da041d808)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I tried to build for Scala 2.11 using sbt with the following command:
```
$ sbt/sbt -Dscala-2.11 assembly
```
but it ends with the following error messages:
```
[error] (streaming-kafka/*:update) sbt.ResolveException: unresolved dependency: org.apache.kafka#kafka_2.11;0.8.0: not found
[error] (catalyst/*:update) sbt.ResolveException: unresolved dependency: org.scalamacros#quasiquotes_2.11;2.0.1: not found
```
The reason is:
If system property `-Dscala-2.11` (without value) was set, `SparkBuild.scala` adds `scala-2.11` profile, but also `sbt-pom-reader` activates `scala-2.10` profile instead of `scala-2.11` profile because the activator `PropertyProfileActivator` used by `sbt-pom-reader` internally checks if the property value is empty or not.
The value is set to non-empty value, then no need to add profiles in `SparkBuild.scala` because `sbt-pom-reader` can handle as expected.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #3342 from ueshin/issues/SPARK-4429 and squashes the following commits:
14d86e8 [Takuya UESHIN] Add a comment.
4eef52b [Takuya UESHIN] Remove unneeded condition.
ce98d0f [Takuya UESHIN] Set non-empty value to system property "scala-2.11" if the property exists instead of adding profile.
(cherry picked from commit f9adda9afb63bfdb722be95304f991a3b38a54b3)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Somehow maven shade plugin is set in infinite loop of creating effective pom.
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Prashant Sharma <scrapcodes@gmail.com>
Closes #2959 from ScrapCodes/SPARK-3962/scope-provided and squashes the following commits:
994d1d3 [Prashant Sharma] Fixed failing flume tests
270b4fb [Prashant Sharma] Removed most of the unused code.
bb3bbfd [Prashant Sharma] SPARK-3962 Marked scope as provided for external.
(cherry picked from commit 1c938413ba5579034675f1b4ea3b8fd0e47dd8d6)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
default
The write ahead log of ReceivedBlockTracker gets enabled as soon as checkpoint directory is set. This should not happen, as the WAL should be enabled only if the WAL is enabled in the Spark configuration.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #3358 from tdas/SPARK-4482 and squashes the following commits:
b740136 [Tathagata Das] Fixed bug in ReceivedBlockTracker
(cherry picked from commit 22fc4e751c0a2f0ff39e42aa0a8fb9459d7412ec)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When running Spark locally, if number of threads is specified as 0 (e.g., `spark-submit --master local[0] ...`), the job got stuck and does not run at all.
I think it's better to validate the parameter.
Fix for [SPARK-4470](https://issues.apache.org/jira/browse/SPARK-4470).
Author: Kenichi Maehashi <webmaster@kenichimaehashi.com>
Closes #3337 from kmaehashi/spark-4470 and squashes the following commits:
3ad76f3 [Kenichi Maehashi] fix code style
7716734 [Kenichi Maehashi] SPARK-4470: Validate number of threads in local mode
(cherry picked from commit eacc788346ccae232bd530dd880f801475a49734)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the elementsRead variable should be reset to 0 after each spilling
Author: Tianshuo Deng <tdeng@twitter.com>
Closes #3302 from tsdeng/fix_external_sorter_record_count and squashes the following commits:
7b56ca0 [Tianshuo Deng] fix method signature
782c7de [Tianshuo Deng] make elementsRead private, fix comment
bb7ff28 [Tianshuo Deng] update elemetsRead through addElementsRead method
74ca246 [Tianshuo Deng] fix elements read count
(cherry picked from commit d75579d09912cfb1eeac0589d625ea0452701fa0)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
pwendell
Please take a look
Author: tedyu <yuzhihong@gmail.com>
Closes #3286 from tedyu/master and squashes the following commits:
e61e610 [tedyu] SPARK-4455 Exclude dependency on hbase-annotations module
7e3a57a [tedyu] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark
2f28b08 [tedyu] Exclude dependency on hbase-annotations module
(cherry picked from commit 5f5ac2dafaf849d2375c81d699d82874ac462b49)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
|
|
|
|
|
|
|
|
|
|
| |
InStream is not closed after data is read from Tachyon. which makes the blocks in Tachyon locked after accessed.
Author: Mingfei <mingfei.shi@intel.com>
Closes #3290 from shimingfei/lockFix and squashes the following commits:
fffe345 [Mingfei] close InStream after the block is accessed
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently Tachyon client is not closed when TachyonBlockManager is shut down. which causes some resources in Tachyon not reclaimed
Author: Mingfei <mingfei.shi@intel.com>
Closes #3299 from shimingfei/closeClient and squashes the following commits:
0913fbd [Mingfei] close Tachyon client when TachyonBlockManager is shutdown
(cherry picked from commit 67e9876b3e457b151c123fdb5ac2d8e8371e6acf)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
with literals on the left hand side
For expressions like `10 < someVar`, we should create an `Operators.Gt` filter, but right now an `Operators.Lt` is created. This issue affects all inequality predicates with literals on the left hand side.
(This bug existed before #3317 and affects branch-1.1. #3338 was opened to backport this to branch-1.1.)
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3334)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #3334 from liancheng/fix-parquet-comp-filter and squashes the following commits:
0130897 [Cheng Lian] Fixes Parquet comparison filter generation
(cherry picked from commit 423baea953996a66dde671ff6db2fb1f32fbe8cb)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
```
pyspark.RDD.randomSplit(self, weights, seed=None)
Randomly splits this RDD with the provided weights.
:param weights: weights for splits, will be normalized if they don't sum to 1
:param seed: random seed
:return: split RDDs in an list
>>> rdd = sc.parallelize(range(10), 1)
>>> rdd1, rdd2, rdd3 = rdd.randomSplit([0.4, 0.6, 1.0], 11)
>>> rdd1.collect()
[3, 6]
>>> rdd2.collect()
[0, 5, 7]
>>> rdd3.collect()
[1, 2, 4, 8, 9]
```
Author: Davies Liu <davies@databricks.com>
Closes #3193 from davies/randomSplit and squashes the following commits:
78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
f5fdf63 [Davies Liu] fix bug with int in weights
4dfa2cd [Davies Liu] refactor
f866bcf [Davies Liu] remove unneeded change
c7a2007 [Davies Liu] switch to python implementation
95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
0d9b256 [Davies Liu] refactor
1715ee3 [Davies Liu] address comments
41fce54 [Davies Liu] randomSplit()
(cherry picked from commit 7f22fa81ebd5e501fcb0e1da5506d1d4fb9250cf)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Spark hangs with the following code:
~~~
sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
~~~
This is because ZippedWithIndexRDD triggers a job in getPartitions and it causes a deadlock in DAGScheduler.getPreferredLocs (synced). The fix is to compute `startIndices` during construction.
This should be applied to branch-1.0, branch-1.1, and branch-1.2.
pwendell
Author: Xiangrui Meng <meng@databricks.com>
Closes #3291 from mengxr/SPARK-4433 and squashes the following commits:
c284d9f [Xiangrui Meng] fix a racing condition in zipWithIndex
(cherry picked from commit bb46046154a438df4db30a0e1fd557bd3399ee7b)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch will bring support for broadcasting objects larger than 2G.
pickle, zlib, FrameSerializer and Array[Byte] all can not support objects larger than 2G, so this patch introduce LargeObjectSerializer to serialize broadcast objects, the object will be serialized and compressed into small chunks, it also change the type of Broadcast[Array[Byte]]] into Broadcast[Array[Array[Byte]]]].
Testing for support broadcast objects larger than 2G is slow and memory hungry, so this is tested manually, could be added into SparkPerf.
Author: Davies Liu <davies@databricks.com>
Author: Davies Liu <davies.liu@gmail.com>
Closes #2659 from davies/huge and squashes the following commits:
7b57a14 [Davies Liu] add more tests for broadcast
28acff9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
a2f6a02 [Davies Liu] bug fix
4820613 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
5875c73 [Davies Liu] address comments
10a349b [Davies Liu] address comments
0c33016 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
6182c8f [Davies Liu] Merge branch 'master' into huge
d94b68f [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
2514848 [Davies Liu] address comments
fda395b [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
1c2d928 [Davies Liu] fix scala style
091b107 [Davies Liu] broadcast objects larger than 2G
(cherry picked from commit 4a377aff2d36b64a65b54192a987aba44b8f78e0)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
```
class LogisticRegressionWithLBFGS
| train(cls, data, iterations=100, initialWeights=None, corrections=10, tolerance=0.0001, regParam=0.01, intercept=False)
| Train a logistic regression model on the given data.
|
| :param data: The training data, an RDD of LabeledPoint.
| :param iterations: The number of iterations (default: 100).
| :param initialWeights: The initial weights (default: None).
| :param regParam: The regularizer parameter (default: 0.01).
| :param regType: The type of regularizer used for training
| our model.
| :Allowed values:
| - "l1" for using L1 regularization
| - "l2" for using L2 regularization
| - None for no regularization
| (default: "l2")
| :param intercept: Boolean parameter which indicates the use
| or not of the augmented representation for
| training data (i.e. whether bias features
| are activated or not).
| :param corrections: The number of corrections used in the LBFGS update (default: 10).
| :param tolerance: The convergence tolerance of iterations for L-BFGS (default: 1e-4).
|
| >>> data = [
| ... LabeledPoint(0.0, [0.0, 1.0]),
| ... LabeledPoint(1.0, [1.0, 0.0]),
| ... ]
| >>> lrm = LogisticRegressionWithLBFGS.train(sc.parallelize(data))
| >>> lrm.predict([1.0, 0.0])
| 1
| >>> lrm.predict([0.0, 1.0])
| 0
| >>> lrm.predict(sc.parallelize([[1.0, 0.0], [0.0, 1.0]])).collect()
| [1, 0]
```
Author: Davies Liu <davies@databricks.com>
Closes #3307 from davies/lbfgs and squashes the following commits:
34bd986 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into lbfgs
5a945a6 [Davies Liu] address comments
941061b [Davies Liu] Merge branch 'master' of github.com:apache/spark into lbfgs
03e5543 [Davies Liu] add it to docs
ed2f9a8 [Davies Liu] add regType
76cd1b6 [Davies Liu] reorder arguments
4429a74 [Davies Liu] Update classification.py
9252783 [Davies Liu] python api for LogisticRegressionWithLBFGS
(cherry picked from commit d2e29516f2064f93f3a9070c91fc7460706e0b0a)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit removes the behavior where when a user clicks
"Show additional metrics" on the stage page, all of the additional
metrics are automatically selected; now, collapsing and expanding
the additional metrics has no effect on which options are selected.
Instead, there's a "(De)select All" box at the top; checking this box
checks all additional metrics (and similarly, unchecking it unchecks
all additional metrics).
This commit is intended to be backported to 1.2, so that the additional
metrics behavior is not confusing to users.
Now when a user clicks the "Show additional metrics" menu, this is what
it looks like:
![image](https://cloud.githubusercontent.com/assets/1108612/5094347/1541ead6-6f15-11e4-8e8c-25a65ddbdfb2.png)
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes #3331 from kayousterhout/SPARK-4463 and squashes the following commits:
9e17cea [Kay Ousterhout] Added italics
b731230 [Kay Ousterhout] [SPARK-4463] Add (de)select all button for add'l metrics.
(cherry picked from commit 010bc86e40a0e54b6850b75abd6105e70eb1af10)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The progress bar will look like this:
![1___spark_job__85_250_finished__4_are_running___java_](https://cloud.githubusercontent.com/assets/40902/4854813/a02f44ac-6099-11e4-9060-7c73a73151d6.png)
In the right corner, the numbers are: finished tasks, running tasks, total tasks.
After the stage has finished, it will disappear.
The progress bar is only showed if logging level is WARN or higher (but progress in title is still showed), it can be turned off by spark.driver.showConsoleProgress.
Author: Davies Liu <davies@databricks.com>
Closes #3029 from davies/progress and squashes the following commits:
95336d5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
fc49ac8 [Davies Liu] address commentse
2e90f75 [Davies Liu] show multiple stages in same time
0081bcc [Davies Liu] address comments
38c42f1 [Davies Liu] fix tests
ab87958 [Davies Liu] disable progress bar during tests
30ac852 [Davies Liu] re-implement progress bar
b3f34e5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
6fd30ff [Davies Liu] show progress bar if no task finished in 500ms
e4e7344 [Davies Liu] refactor
e1f524d [Davies Liu] revert unnecessary change
a60477c [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
5cae3f2 [Davies Liu] fix style
ea49fe0 [Davies Liu] address comments
bc53d99 [Davies Liu] refactor
e6bb189 [Davies Liu] fix logging in sparkshell
7e7d4e7 [Davies Liu] address commments
5df26bb [Davies Liu] fix style
9e42208 [Davies Liu] show progress bar in console and title
(cherry picked from commit e34f38ff1a0dfbb0ffa4bd11071e03b1a58de998)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If SparkSubmit die first, then bootstrapper will be blocked by shutdown hook. sys.exit() in a shutdown hook will cause some kind of dead lock.
cc andrewor14
Author: Davies Liu <davies@databricks.com>
Closes #3289 from davies/fix_bootstraper and squashes the following commits:
ea5cdd1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_bootstraper
e04b690 [Davies Liu] remove sys.exit in hook
4d11366 [Davies Liu] remove shutdown hook if subprocess die fist
(cherry picked from commit 80f31778820586a93d73fa15279a204611cc3c60)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR adds a regression test for SPARK-4434.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #3326 from sarutak/add-triple-slash-testcase and squashes the following commits:
82bc9cc [Kousuke Saruta] Fixed wrong grammar in comment
9149027 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into add-triple-slash-testcase
c1c80ca [Kousuke Saruta] Fixed style
4f30210 [Kousuke Saruta] Modified comments
9e09da2 [Kousuke Saruta] Fixed URI validation for jar file
d4b99ef [Kousuke Saruta] [SPARK-4075] [Deploy] Jar url validation is not enough for Jar file
ac79906 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into add-triple-slash-testcase
6d4f47e [Kousuke Saruta] Added a test case as a regression check for SPARK-4434
(cherry picked from commit bfebfd8b28eeb7e75292333f7885aa0830fcb5fe)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
directory and the file
Author: Michael Armbrust <michael@databricks.com>
Closes #3272 from marmbrus/keyInPartitionedTable and squashes the following commits:
447f08c [Michael Armbrust] Support partitioned parquet tables that have the key in both the directory and the file
(cherry picked from commit 90d72ec8502f7ec11d2fe42f08c884ad2159266f)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In PySpark, ALS can take an RDD of (user, product, rating) tuples as input. However, model.predict outputs an RDD of Rating. So on the input side, users can use r[0], r[1], r[2], while on the output side, users have to use r.user, r.product, r.rating. We should allow lookup by index in Rating by making Rating a namedtuple.
davies
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3261)
<!-- Reviewable:end -->
Author: Xiangrui Meng <meng@databricks.com>
Closes #3261 from mengxr/SPARK-4396 and squashes the following commits:
543aef0 [Xiangrui Meng] use named tuple to implement ALS
0b61bae [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4396
d3bd7d4 [Xiangrui Meng] allow lookup by index in Python's Rating
(cherry picked from commit b54c6ab3c54e65238d6766832ea1f3fcd694f2fd)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR add setThrehold() and clearThreshold() for LogisticRegressionModel and SVMModel, also support RDD of vector in LogisticRegressionModel.predict(), SVNModel.predict() and NaiveBayes.predict()
Author: Davies Liu <davies@databricks.com>
Closes #3305 from davies/setThreshold and squashes the following commits:
d0b835f [Davies Liu] Merge branch 'master' of github.com:apache/spark into setThreshold
e4acd76 [Davies Liu] address comments
2231a5f [Davies Liu] bugfix
7bd9009 [Davies Liu] address comments
0b0a8a7 [Davies Liu] address comments
c1e5573 [Davies Liu] improve classification
(cherry picked from commit 8fbf72b7903b5bbec8d949151aa4693b4af26ff5)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Felix Maximilian Möller <felixmaximilian.moeller@immobilienscout24.de>
Closes #3343 from felixmaximilian/fix-documentation and squashes the following commits:
43dcdfb [Felix Maximilian Möller] Removed the information about the switch implicitPrefs. The parameter implicitPrefs cannot be set in this context because it is inherent true when calling the trainImplicit method.
7d172ba [Felix Maximilian Möller] added missing parameter alpha in doc string.
(cherry picked from commit cedc3b5aa43a16e2da62f12a36317f00aa1002cc)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The maven release plug-in does not have support for publishing two separate sets of artifacts for a single release. Because of the way that Scala 2.11 support in Spark works, we have to write some customized code to do this. The good news is that the Maven release API is just a thin wrapper on doing git commits and pushing artifacts to the HTTP API of Apache's Sonatype server and this might overall make our deployment easier to understand.
This was already used for the 1.2 snapshot, so I think it is working well. One other nice thing is this could be pretty easily extended to publish nightly snapshots.
Author: Patrick Wendell <pwendell@gmail.com>
Closes #3332 from pwendell/releases and squashes the following commits:
2fedaed [Patrick Wendell] Automate the opening and closing of Sonatype repos
e2a24bb [Patrick Wendell] Fixing issue where we overrode non-spark version numbers
9df3a50 [Patrick Wendell] Adding TODO
1cc1749 [Patrick Wendell] Don't build the thriftserver for 2.11
933201a [Patrick Wendell] Make tagging of release commit eager
d0388a6 [Patrick Wendell] Support Scala 2.11 build
4f4dc62 [Patrick Wendell] Change to 2.11 should not be included when committing new patch
bf742e1 [Patrick Wendell] Minor fixes
ffa1df2 [Patrick Wendell] Adding a Scala 2.11 package to test it
9ac4381 [Patrick Wendell] Addressing TODO
b3105ff [Patrick Wendell] Removing commented out code
d906803 [Patrick Wendell] Small fix
3f4d985 [Patrick Wendell] More work
fcd54c2 [Patrick Wendell] Consolidating use of keys
df2af30 [Patrick Wendell] Changes to release stuff
(cherry picked from commit c6e0c2ab1c29c184a9302d23ad75e4ccd8060242)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification.
While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L213-L228)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot.
The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #3317 from liancheng/simplify-parquet-filters and squashes the following commits:
d6a9499 [Cheng Lian] Fixes import styling issue
43760e8 [Cheng Lian] Simplifies Parquet filter generation logic
(cherry picked from commit 36b0956a3eadc7343ed0d25c79a6ce0496eaaccd)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3308 from chenghao-intel/unwrap_constant_oi and squashes the following commits:
156b500 [Cheng Hao] rebase the master
c5b20ab [Cheng Hao] unwrap for the ConstantObjectInspector
(cherry picked from commit ef7c464effa1510b24bd8e665e4df6c4839b0c87)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The `totalSize` of external table is always zero, which will influence join strategy(always use broadcast join for external table).
Author: w00228970 <wangfei1@huawei.com>
Closes #3304 from scwf/statistics and squashes the following commits:
568f321 [w00228970] fix statistics for external table
(cherry picked from commit 42389b1780311d90499b4ce2315ceabf5b6ab384)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
for complex types
This PR is exactly the same as #3178 except it reverts the `FileStatus.isDir` to `FileStatus.isDirectory` change, since it doesn't compile with Hadoop 1.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3298)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #3298 from liancheng/date-for-thriftserver and squashes the following commits:
866037e [Cheng Lian] Revers isDirectory to isDir (it breaks Hadoop 1 profile)
6f71d0b [Cheng Lian] Makes toHiveString static
26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim
a92882a [Cheng Lian] Updates HiveShim for 0.13.1
73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0)
(cherry picked from commit 6b7f2f753d16ff038881772f1958e3f4fd5597a7)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3217 from chenghao-intel/mutablerow and squashes the following commits:
e8a10bd [Cheng Hao] revert the change of Row object
4681aea [Cheng Hao] Add toMutableRow method in object Row
a751838 [Cheng Hao] Construct the MutableRow from an existed row
(cherry picked from commit 69e858cc7748b6babadd0cbe20e65f3982161cbf)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`Cast` from `NaN` or `Infinity` of `Double` or `Float` to `TimestampType` throws `NumberFormatException`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #3283 from ueshin/issues/SPARK-4425 and squashes the following commits:
14def0c [Takuya UESHIN] Fix Cast to be able to handle NaN or Infinity to TimestampType.
(cherry picked from commit 566c791931645bfaaaf57ee5a15b9ffad534f81e)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
DecimalType.
This is follow-up of [SPARK-4390](https://issues.apache.org/jira/browse/SPARK-4390) (#3256).
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #3278 from ueshin/issues/SPARK-4420 and squashes the following commits:
7fea558 [Takuya UESHIN] Add some tests.
cb2301a [Takuya UESHIN] Fix tests.
133bad5 [Takuya UESHIN] Change nullability of Cast from DoubleType/FloatType to DecimalType.
(cherry picked from commit 3a81a1c9e0963173534d96850f3c0b7a16350838)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is inspired by the [Parquet record filter generation code](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala#L387-L400).
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3318)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #3318 from liancheng/aggresive-conj-pushdown and squashes the following commits:
78b69d2 [Cheng Lian] Makes conjunction pushdown more aggressive
(cherry picked from commit 5ce7dae859dc273b0fc532c9456b5960b1eca399)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|