aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-4429][BUILD] Build for Scala 2.11 using sbt fails.Takuya UESHIN2014-11-191-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | I tried to build for Scala 2.11 using sbt with the following command: ``` $ sbt/sbt -Dscala-2.11 assembly ``` but it ends with the following error messages: ``` [error] (streaming-kafka/*:update) sbt.ResolveException: unresolved dependency: org.apache.kafka#kafka_2.11;0.8.0: not found [error] (catalyst/*:update) sbt.ResolveException: unresolved dependency: org.scalamacros#quasiquotes_2.11;2.0.1: not found ``` The reason is: If system property `-Dscala-2.11` (without value) was set, `SparkBuild.scala` adds `scala-2.11` profile, but also `sbt-pom-reader` activates `scala-2.10` profile instead of `scala-2.11` profile because the activator `PropertyProfileActivator` used by `sbt-pom-reader` internally checks if the property value is empty or not. The value is set to non-empty value, then no need to add profiles in `SparkBuild.scala` because `sbt-pom-reader` can handle as expected. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #3342 from ueshin/issues/SPARK-4429 and squashes the following commits: 14d86e8 [Takuya UESHIN] Add a comment. 4eef52b [Takuya UESHIN] Remove unneeded condition. ce98d0f [Takuya UESHIN] Set non-empty value to system property "scala-2.11" if the property exists instead of adding profile.
* [DOC][PySpark][Streaming] Fix docstring for sphinxKen Takagiwa2014-11-191-2/+2
| | | | | | | | | | | This commit should be merged for 1.2 release. cc tdas Author: Ken Takagiwa <ugw.gi.world@gmail.com> Closes #3311 from giwa/patch-3 and squashes the following commits: ab474a8 [Ken Takagiwa] [DOC][PySpark][Streaming] Fix docstring for sphinx
* SPARK-3962 Marked scope as provided for external projects.Prashant Sharma2014-11-1914-48/+264
| | | | | | | | | | | | | Somehow maven shade plugin is set in infinite loop of creating effective pom. Author: Prashant Sharma <prashant.s@imaginea.com> Author: Prashant Sharma <scrapcodes@gmail.com> Closes #2959 from ScrapCodes/SPARK-3962/scope-provided and squashes the following commits: 994d1d3 [Prashant Sharma] Fixed failing flume tests 270b4fb [Prashant Sharma] Removed most of the unused code. bb3bbfd [Prashant Sharma] SPARK-3962 Marked scope as provided for external.
* [HOT FIX] MiMa tests are brokenAndrew Or2014-11-191-0/+6
| | | | | | | | | | | This is blocking #3353 and other patches. Author: Andrew Or <andrew@databricks.com> Closes #3371 from andrewor14/mima-hot-fix and squashes the following commits: 842d059 [Andrew Or] Move excludes to the right section c4d4f4e [Andrew Or] MIMA hot fix
* [SPARK-4481][Streaming][Doc] Fix the wrong description of updateFunczsxwing2014-11-191-9/+7
| | | | | | | | | | | Removed `If `this` function returns None, then corresponding state key-value pair will be eliminated.` for the description of `updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)]` Author: zsxwing <zsxwing@gmail.com> Closes #3356 from zsxwing/SPARK-4481 and squashes the following commits: 76a9891 [zsxwing] Add a note that keys may be added or removed 0ebc42a [zsxwing] Fix the wrong description of updateFunc
* [SPARK-4482][Streaming] Disable ReceivedBlockTracker's write ahead log by ↵Tathagata Das2014-11-192-26/+61
| | | | | | | | | | | | default The write ahead log of ReceivedBlockTracker gets enabled as soon as checkpoint directory is set. This should not happen, as the WAL should be enabled only if the WAL is enabled in the Spark configuration. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #3358 from tdas/SPARK-4482 and squashes the following commits: b740136 [Tathagata Das] Fixed bug in ReceivedBlockTracker
* [SPARK-4470] Validate number of threads in local modeKenichi Maehashi2014-11-191-0/+3
| | | | | | | | | | | | | | When running Spark locally, if number of threads is specified as 0 (e.g., `spark-submit --master local[0] ...`), the job got stuck and does not run at all. I think it's better to validate the parameter. Fix for [SPARK-4470](https://issues.apache.org/jira/browse/SPARK-4470). Author: Kenichi Maehashi <webmaster@kenichimaehashi.com> Closes #3337 from kmaehashi/spark-4470 and squashes the following commits: 3ad76f3 [Kenichi Maehashi] fix code style 7716734 [Kenichi Maehashi] SPARK-4470: Validate number of threads in local mode
* [SPARK-4467] fix elements read count for ExtrenalSorterTianshuo Deng2014-11-193-14/+12
| | | | | | | | | | | | | the elementsRead variable should be reset to 0 after each spilling Author: Tianshuo Deng <tdeng@twitter.com> Closes #3302 from tsdeng/fix_external_sorter_record_count and squashes the following commits: 7b56ca0 [Tianshuo Deng] fix method signature 782c7de [Tianshuo Deng] make elementsRead private, fix comment bb7ff28 [Tianshuo Deng] update elemetsRead through addElementsRead method 74ca246 [Tianshuo Deng] fix elements read count
* SPARK-4455 Exclude dependency on hbase-annotations moduletedyu2014-11-191-0/+22
| | | | | | | | | | | | | pwendell Please take a look Author: tedyu <yuzhihong@gmail.com> Closes #3286 from tedyu/master and squashes the following commits: e61e610 [tedyu] SPARK-4455 Exclude dependency on hbase-annotations module 7e3a57a [tedyu] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark 2f28b08 [tedyu] Exclude dependency on hbase-annotations module
* MAINTENANCE: Automated closing of pull requests.Patrick Wendell2014-11-190-0/+0
| | | | | | | | | This commit exists to close the following pull requests on Github: Closes #2777 (close requested by 'ankurdave') Closes #2947 (close requested by 'nchammas') Closes #3141 (close requested by 'tdas') Closes #2989 (close requested by 'pwendell')
* [Spark-4432]close InStream after the block is accessedMingfei2014-11-181-0/+2
| | | | | | | | | | InStream is not closed after data is read from Tachyon. which makes the blocks in Tachyon locked after accessed. Author: Mingfei <mingfei.shi@intel.com> Closes #3290 from shimingfei/lockFix and squashes the following commits: fffe345 [Mingfei] close InStream after the block is accessed
* [SPARK-4441] Close Tachyon client when TachyonBlockManager is shutdownMingfei2014-11-181-0/+1
| | | | | | | | | | Currently Tachyon client is not closed when TachyonBlockManager is shut down. which causes some resources in Tachyon not reclaimed Author: Mingfei <mingfei.shi@intel.com> Closes #3299 from shimingfei/closeClient and squashes the following commits: 0913fbd [Mingfei] close Tachyon client when TachyonBlockManager is shutdown
* Bumping version to 1.3.0-SNAPSHOT.Marcelo Vanzin2014-11-1834-39/+63
| | | | | | | | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #3277 from vanzin/version-1.3 and squashes the following commits: 7c3c396 [Marcelo Vanzin] Added temp repo to sbt build. 5f404ff [Marcelo Vanzin] Add another exclusion. 19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo. 3c8d705 [Marcelo Vanzin] Workaround for MIMA checks. e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
* [SPARK-4468][SQL] Fixes Parquet filter creation for inequality predicates ↵Cheng Lian2014-11-182-4/+16
| | | | | | | | | | | | | | | | | | with literals on the left hand side For expressions like `10 < someVar`, we should create an `Operators.Gt` filter, but right now an `Operators.Lt` is created. This issue affects all inequality predicates with literals on the left hand side. (This bug existed before #3317 and affects branch-1.1. #3338 was opened to backport this to branch-1.1.) <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3334) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3334 from liancheng/fix-parquet-comp-filter and squashes the following commits: 0130897 [Cheng Lian] Fixes Parquet comparison filter generation
* [SPARK-4327] [PySpark] Python API for RDD.randomSplit()Davies Liu2014-11-182-3/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ``` pyspark.RDD.randomSplit(self, weights, seed=None) Randomly splits this RDD with the provided weights. :param weights: weights for splits, will be normalized if they don't sum to 1 :param seed: random seed :return: split RDDs in an list >>> rdd = sc.parallelize(range(10), 1) >>> rdd1, rdd2, rdd3 = rdd.randomSplit([0.4, 0.6, 1.0], 11) >>> rdd1.collect() [3, 6] >>> rdd2.collect() [0, 5, 7] >>> rdd3.collect() [1, 2, 4, 8, 9] ``` Author: Davies Liu <davies@databricks.com> Closes #3193 from davies/randomSplit and squashes the following commits: 78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain f5fdf63 [Davies Liu] fix bug with int in weights 4dfa2cd [Davies Liu] refactor f866bcf [Davies Liu] remove unneeded change c7a2007 [Davies Liu] switch to python implementation 95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit 0d9b256 [Davies Liu] refactor 1715ee3 [Davies Liu] address comments 41fce54 [Davies Liu] randomSplit()
* [SPARK-4433] fix a racing condition in zipWithIndexXiangrui Meng2014-11-182-14/+22
| | | | | | | | | | | | | | | | | | | | Spark hangs with the following code: ~~~ sc.parallelize(1 to 10).zipWithIndex.repartition(10).count() ~~~ This is because ZippedWithIndexRDD triggers a job in getPartitions and it causes a deadlock in DAGScheduler.getPreferredLocs (synced). The fix is to compute `startIndices` during construction. This should be applied to branch-1.0, branch-1.1, and branch-1.2. pwendell Author: Xiangrui Meng <meng@databricks.com> Closes #3291 from mengxr/SPARK-4433 and squashes the following commits: c284d9f [Xiangrui Meng] fix a racing condition in zipWithIndex
* [SPARK-3721] [PySpark] broadcast objects larger than 2GDavies Liu2014-11-189-27/+257
| | | | | | | | | | | | | | | | | | | | | | | | | | | This patch will bring support for broadcasting objects larger than 2G. pickle, zlib, FrameSerializer and Array[Byte] all can not support objects larger than 2G, so this patch introduce LargeObjectSerializer to serialize broadcast objects, the object will be serialized and compressed into small chunks, it also change the type of Broadcast[Array[Byte]]] into Broadcast[Array[Array[Byte]]]]. Testing for support broadcast objects larger than 2G is slow and memory hungry, so this is tested manually, could be added into SparkPerf. Author: Davies Liu <davies@databricks.com> Author: Davies Liu <davies.liu@gmail.com> Closes #2659 from davies/huge and squashes the following commits: 7b57a14 [Davies Liu] add more tests for broadcast 28acff9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge a2f6a02 [Davies Liu] bug fix 4820613 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge 5875c73 [Davies Liu] address comments 10a349b [Davies Liu] address comments 0c33016 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge 6182c8f [Davies Liu] Merge branch 'master' into huge d94b68f [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge 2514848 [Davies Liu] address comments fda395b [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge 1c2d928 [Davies Liu] fix scala style 091b107 [Davies Liu] broadcast objects larger than 2G
* [SPARK-4306] [MLlib] Python API for LogisticRegressionWithLBFGSDavies Liu2014-11-182-4/+88
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ``` class LogisticRegressionWithLBFGS | train(cls, data, iterations=100, initialWeights=None, corrections=10, tolerance=0.0001, regParam=0.01, intercept=False) | Train a logistic regression model on the given data. | | :param data: The training data, an RDD of LabeledPoint. | :param iterations: The number of iterations (default: 100). | :param initialWeights: The initial weights (default: None). | :param regParam: The regularizer parameter (default: 0.01). | :param regType: The type of regularizer used for training | our model. | :Allowed values: | - "l1" for using L1 regularization | - "l2" for using L2 regularization | - None for no regularization | (default: "l2") | :param intercept: Boolean parameter which indicates the use | or not of the augmented representation for | training data (i.e. whether bias features | are activated or not). | :param corrections: The number of corrections used in the LBFGS update (default: 10). | :param tolerance: The convergence tolerance of iterations for L-BFGS (default: 1e-4). | | >>> data = [ | ... LabeledPoint(0.0, [0.0, 1.0]), | ... LabeledPoint(1.0, [1.0, 0.0]), | ... ] | >>> lrm = LogisticRegressionWithLBFGS.train(sc.parallelize(data)) | >>> lrm.predict([1.0, 0.0]) | 1 | >>> lrm.predict([0.0, 1.0]) | 0 | >>> lrm.predict(sc.parallelize([[1.0, 0.0], [0.0, 1.0]])).collect() | [1, 0] ``` Author: Davies Liu <davies@databricks.com> Closes #3307 from davies/lbfgs and squashes the following commits: 34bd986 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into lbfgs 5a945a6 [Davies Liu] address comments 941061b [Davies Liu] Merge branch 'master' of github.com:apache/spark into lbfgs 03e5543 [Davies Liu] add it to docs ed2f9a8 [Davies Liu] add regType 76cd1b6 [Davies Liu] reorder arguments 4429a74 [Davies Liu] Update classification.py 9252783 [Davies Liu] python api for LogisticRegressionWithLBFGS
* [SPARK-4463] Add (de)select all button for add'l metrics.Kay Ousterhout2014-11-182-7/+14
| | | | | | | | | | | | | | | | | | | | | | | | This commit removes the behavior where when a user clicks "Show additional metrics" on the stage page, all of the additional metrics are automatically selected; now, collapsing and expanding the additional metrics has no effect on which options are selected. Instead, there's a "(De)select All" box at the top; checking this box checks all additional metrics (and similarly, unchecking it unchecks all additional metrics). This commit is intended to be backported to 1.2, so that the additional metrics behavior is not confusing to users. Now when a user clicks the "Show additional metrics" menu, this is what it looks like: ![image](https://cloud.githubusercontent.com/assets/1108612/5094347/1541ead6-6f15-11e4-8e8c-25a65ddbdfb2.png) Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #3331 from kayousterhout/SPARK-4463 and squashes the following commits: 9e17cea [Kay Ousterhout] Added italics b731230 [Kay Ousterhout] [SPARK-4463] Add (de)select all button for add'l metrics.
* [SPARK-4017] show progress bar in consoleDavies Liu2014-11-188-1/+141
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The progress bar will look like this: ![1___spark_job__85_250_finished__4_are_running___java_](https://cloud.githubusercontent.com/assets/40902/4854813/a02f44ac-6099-11e4-9060-7c73a73151d6.png) In the right corner, the numbers are: finished tasks, running tasks, total tasks. After the stage has finished, it will disappear. The progress bar is only showed if logging level is WARN or higher (but progress in title is still showed), it can be turned off by spark.driver.showConsoleProgress. Author: Davies Liu <davies@databricks.com> Closes #3029 from davies/progress and squashes the following commits: 95336d5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress fc49ac8 [Davies Liu] address commentse 2e90f75 [Davies Liu] show multiple stages in same time 0081bcc [Davies Liu] address comments 38c42f1 [Davies Liu] fix tests ab87958 [Davies Liu] disable progress bar during tests 30ac852 [Davies Liu] re-implement progress bar b3f34e5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress 6fd30ff [Davies Liu] show progress bar if no task finished in 500ms e4e7344 [Davies Liu] refactor e1f524d [Davies Liu] revert unnecessary change a60477c [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress 5cae3f2 [Davies Liu] fix style ea49fe0 [Davies Liu] address comments bc53d99 [Davies Liu] refactor e6bb189 [Davies Liu] fix logging in sparkshell 7e7d4e7 [Davies Liu] address commments 5df26bb [Davies Liu] fix style 9e42208 [Davies Liu] show progress bar in console and title
* [SPARK-4404] remove sys.exit() in shutdown hookDavies Liu2014-11-181-1/+1
| | | | | | | | | | | | | | If SparkSubmit die first, then bootstrapper will be blocked by shutdown hook. sys.exit() in a shutdown hook will cause some kind of dead lock. cc andrewor14 Author: Davies Liu <davies@databricks.com> Closes #3289 from davies/fix_bootstraper and squashes the following commits: ea5cdd1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_bootstraper e04b690 [Davies Liu] remove sys.exit in hook 4d11366 [Davies Liu] remove shutdown hook if subprocess die fist
* [SPARK-4075][SPARK-4434] Fix the URI validation logic for Application Jar name.Kousuke Saruta2014-11-182-3/+28
| | | | | | | | | | | | | | | | | This PR adds a regression test for SPARK-4434. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3326 from sarutak/add-triple-slash-testcase and squashes the following commits: 82bc9cc [Kousuke Saruta] Fixed wrong grammar in comment 9149027 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into add-triple-slash-testcase c1c80ca [Kousuke Saruta] Fixed style 4f30210 [Kousuke Saruta] Modified comments 9e09da2 [Kousuke Saruta] Fixed URI validation for jar file d4b99ef [Kousuke Saruta] [SPARK-4075] [Deploy] Jar url validation is not enough for Jar file ac79906 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into add-triple-slash-testcase 6d4f47e [Kousuke Saruta] Added a test case as a regression check for SPARK-4434
* [SQL] Support partitioned parquet tables that have the key in both the ↵Michael Armbrust2014-11-182-68/+108
| | | | | | | | | | directory and the file Author: Michael Armbrust <michael@databricks.com> Closes #3272 from marmbrus/keyInPartitionedTable and squashes the following commits: 447f08c [Michael Armbrust] Support partitioned parquet tables that have the key in both the directory and the file
* [SPARK-4396] allow lookup by index in Python's RatingXiangrui Meng2014-11-181-11/+15
| | | | | | | | | | | | | | | | | | In PySpark, ALS can take an RDD of (user, product, rating) tuples as input. However, model.predict outputs an RDD of Rating. So on the input side, users can use r[0], r[1], r[2], while on the output side, users have to use r.user, r.product, r.rating. We should allow lookup by index in Rating by making Rating a namedtuple. davies <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3261) <!-- Reviewable:end --> Author: Xiangrui Meng <meng@databricks.com> Closes #3261 from mengxr/SPARK-4396 and squashes the following commits: 543aef0 [Xiangrui Meng] use named tuple to implement ALS 0b61bae [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4396 d3bd7d4 [Xiangrui Meng] allow lookup by index in Python's Rating
* [SPARK-4435] [MLlib] [PySpark] improve classificationDavies Liu2014-11-183-31/+108
| | | | | | | | | | | | | | | This PR add setThrehold() and clearThreshold() for LogisticRegressionModel and SVMModel, also support RDD of vector in LogisticRegressionModel.predict(), SVNModel.predict() and NaiveBayes.predict() Author: Davies Liu <davies@databricks.com> Closes #3305 from davies/setThreshold and squashes the following commits: d0b835f [Davies Liu] Merge branch 'master' of github.com:apache/spark into setThreshold e4acd76 [Davies Liu] address comments 2231a5f [Davies Liu] bugfix 7bd9009 [Davies Liu] address comments 0b0a8a7 [Davies Liu] address comments c1e5573 [Davies Liu] improve classification
* ALS implicit: added missing parameter alpha in doc stringFelix Maximilian Möller2014-11-181-2/+3
| | | | | | | | | Author: Felix Maximilian Möller <felixmaximilian.moeller@immobilienscout24.de> Closes #3343 from felixmaximilian/fix-documentation and squashes the following commits: 43dcdfb [Felix Maximilian Möller] Removed the information about the switch implicitPrefs. The parameter implicitPrefs cannot be set in this context because it is inherent true when calling the trainImplicit method. 7d172ba [Felix Maximilian Möller] added missing parameter alpha in doc string.
* SPARK-4466: Provide support for publishing Scala 2.11 artifacts to MavenPatrick Wendell2014-11-172-34/+106
| | | | | | | | | | | | | | | | | | | | | | | | | | The maven release plug-in does not have support for publishing two separate sets of artifacts for a single release. Because of the way that Scala 2.11 support in Spark works, we have to write some customized code to do this. The good news is that the Maven release API is just a thin wrapper on doing git commits and pushing artifacts to the HTTP API of Apache's Sonatype server and this might overall make our deployment easier to understand. This was already used for the 1.2 snapshot, so I think it is working well. One other nice thing is this could be pretty easily extended to publish nightly snapshots. Author: Patrick Wendell <pwendell@gmail.com> Closes #3332 from pwendell/releases and squashes the following commits: 2fedaed [Patrick Wendell] Automate the opening and closing of Sonatype repos e2a24bb [Patrick Wendell] Fixing issue where we overrode non-spark version numbers 9df3a50 [Patrick Wendell] Adding TODO 1cc1749 [Patrick Wendell] Don't build the thriftserver for 2.11 933201a [Patrick Wendell] Make tagging of release commit eager d0388a6 [Patrick Wendell] Support Scala 2.11 build 4f4dc62 [Patrick Wendell] Change to 2.11 should not be included when committing new patch bf742e1 [Patrick Wendell] Minor fixes ffa1df2 [Patrick Wendell] Adding a Scala 2.11 package to test it 9ac4381 [Patrick Wendell] Addressing TODO b3105ff [Patrick Wendell] Removing commented out code d906803 [Patrick Wendell] Small fix 3f4d985 [Patrick Wendell] More work fcd54c2 [Patrick Wendell] Consolidating use of keys df2af30 [Patrick Wendell] Changes to release stuff
* [SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation codeCheng Lian2014-11-175-693/+161
| | | | | | | | | | | | | | | | | | | While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification. While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L213-L228)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot. The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3317 from liancheng/simplify-parquet-filters and squashes the following commits: d6a9499 [Cheng Lian] Fixes import styling issue 43760e8 [Cheng Lian] Simplifies Parquet filter generation logic
* [SPARK-4448] [SQL] unwrap for the ConstantObjectInspectorCheng Hao2014-11-171-4/+32
| | | | | | | | | Author: Cheng Hao <hao.cheng@intel.com> Closes #3308 from chenghao-intel/unwrap_constant_oi and squashes the following commits: 156b500 [Cheng Hao] rebase the master c5b20ab [Cheng Hao] unwrap for the ConstantObjectInspector
* [SPARK-4443][SQL] Fix statistics for external table in spark sql hivew002289702014-11-173-3/+12
| | | | | | | | | | The `totalSize` of external table is always zero, which will influence join strategy(always use broadcast join for external table). Author: w00228970 <wangfei1@huawei.com> Closes #3304 from scwf/statistics and squashes the following commits: 568f321 [w00228970] fix statistics for external table
* [SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes ↵Cheng Lian2014-11-174-114/+141
| | | | | | | | | | | | | | | | | | | | for complex types This PR is exactly the same as #3178 except it reverts the `FileStatus.isDir` to `FileStatus.isDirectory` change, since it doesn't compile with Hadoop 1. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3298) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3298 from liancheng/date-for-thriftserver and squashes the following commits: 866037e [Cheng Lian] Revers isDirectory to isDir (it breaks Hadoop 1 profile) 6f71d0b [Cheng Lian] Makes toHiveString static 26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim a92882a [Cheng Lian] Updates HiveShim for 0.13.1 73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0)
* [SQL] Construct the MutableRow from an ArrayCheng Hao2014-11-171-2/+4
| | | | | | | | | | Author: Cheng Hao <hao.cheng@intel.com> Closes #3217 from chenghao-intel/mutablerow and squashes the following commits: e8a10bd [Cheng Hao] revert the change of Row object 4681aea [Cheng Hao] Add toMutableRow method in object Row a751838 [Cheng Hao] Construct the MutableRow from an existed row
* [SPARK-4425][SQL] Handle NaN or Infinity cast to Timestamp correctly.Takuya UESHIN2014-11-172-2/+17
| | | | | | | | | | `Cast` from `NaN` or `Infinity` of `Double` or `Float` to `TimestampType` throws `NumberFormatException`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #3283 from ueshin/issues/SPARK-4425 and squashes the following commits: 14def0c [Takuya UESHIN] Fix Cast to be able to handle NaN or Infinity to TimestampType.
* [SPARK-4420][SQL] Change nullability of Cast from DoubleType/FloatType to ↵Takuya UESHIN2014-11-172-2/+14
| | | | | | | | | | | | | | DecimalType. This is follow-up of [SPARK-4390](https://issues.apache.org/jira/browse/SPARK-4390) (#3256). Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #3278 from ueshin/issues/SPARK-4420 and squashes the following commits: 7fea558 [Takuya UESHIN] Add some tests. cb2301a [Takuya UESHIN] Fix tests. 133bad5 [Takuya UESHIN] Change nullability of Cast from DoubleType/FloatType to DecimalType.
* [SQL] Makes conjunction pushdown more aggressive for in-memory tableCheng Lian2014-11-172-5/+11
| | | | | | | | | | | | | | This is inspired by the [Parquet record filter generation code](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala#L387-L400). <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3318) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3318 from liancheng/aggresive-conj-pushdown and squashes the following commits: 78b69d2 [Cheng Lian] Makes conjunction pushdown more aggressive
* [SPARK-4180] [Core] Prevent creation of multiple active SparkContextsJosh Rosen2014-11-179-126/+347
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds error-detection logic to throw an exception when attempting to create multiple active SparkContexts in the same JVM, since this is currently unsupported and has been known to cause confusing behavior (see SPARK-2243 for more details). **The solution implemented here is only a partial fix.** A complete fix would have the following properties: 1. Only one SparkContext may ever be under construction at any given time. 2. Once a SparkContext has been successfully constructed, any subsequent construction attempts should fail until the active SparkContext is stopped. 3. If the SparkContext constructor throws an exception, then all resources created in the constructor should be cleaned up (SPARK-4194). 4. If a user attempts to create a SparkContext but the creation fails, then the user should be able to create new SparkContexts. This PR only provides 2) and 4); we should be able to provide all of these properties, but the correct fix will involve larger changes to SparkContext's construction / initialization, so we'll target it for a different Spark release. ### The correct solution: I think that the correct way to do this would be to move the construction of SparkContext's dependencies into a static method in the SparkContext companion object. Specifically, we could make the default SparkContext constructor `private` and change it to accept a `SparkContextDependencies` object that contains all of SparkContext's dependencies (e.g. DAGScheduler, ContextCleaner, etc.). Secondary constructors could call a method on the SparkContext companion object to create the `SparkContextDependencies` and pass the result to the primary SparkContext constructor. For example: ```scala class SparkContext private (deps: SparkContextDependencies) { def this(conf: SparkConf) { this(SparkContext.getDeps(conf)) } } object SparkContext( private[spark] def getDeps(conf: SparkConf): SparkContextDependencies = synchronized { if (anotherSparkContextIsActive) { throw Exception(...) } var dagScheduler: DAGScheduler = null try { dagScheduler = new DAGScheduler(...) [...] } catch { case e: Exception => Option(dagScheduler).foreach(_.stop()) [...] } SparkContextDependencies(dagScheduler, ....) } } ``` This gives us mutual exclusion and ensures that any resources created during the failed SparkContext initialization are properly cleaned up. This indirection is necessary to maintain binary compatibility. In retrospect, it would have been nice if SparkContext had no private constructors and could only be created through builder / factory methods on its companion object, since this buys us lots of flexibility and makes dependency injection easier. ### Alternative solutions: As an alternative solution, we could refactor SparkContext's primary constructor to perform all object creation in a giant `try-finally` block. Unfortunately, this will require us to turn a bunch of `vals` into `vars` so that they can be assigned from the `try` block. If we still want `vals`, we could wrap each `val` in its own `try` block (since the try block can return a value), but this will lead to extremely messy code and won't guard against the introduction of future code which doesn't properly handle failures. The more complex approach outlined above gives us some nice dependency injection benefits, so I think that might be preferable to a `var`-ification. ### This PR's solution: - At the start of the constructor, check whether some other SparkContext is active; if so, throw an exception. - If another SparkContext might be under construction (or has thrown an exception during construction), allow the new SparkContext to begin construction but log a warning (since resources might have been leaked from a failed creation attempt). - At the end of the SparkContext constructor, check whether some other SparkContext constructor has raced and successfully created an active context. If so, throw an exception. This guarantees that no two SparkContexts will ever be active and exposed to users (since we check at the very end of the constructor). If two threads race to construct SparkContexts, then one of them will win and another will throw an exception. This exception can be turned into a warning by setting `spark.driver.allowMultipleContexts = true`. The exception is disabled in unit tests, since there are some suites (such as Hive) that may require more significant refactoring to clean up their SparkContexts. I've made a few changes to other suites' test fixtures to properly clean up SparkContexts so that the unit test logs contain fewer warnings. Author: Josh Rosen <joshrosen@databricks.com> Closes #3121 from JoshRosen/SPARK-4180 and squashes the following commits: 23c7123 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180 d38251b [Josh Rosen] Address latest round of feedback. c0987d3 [Josh Rosen] Accept boolean instead of SparkConf in methods. 85a424a [Josh Rosen] Incorporate more review feedback. 372d0d3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180 f5bb78c [Josh Rosen] Update mvn build, too. d809cb4 [Josh Rosen] Improve handling of failed SparkContext creation attempts. 79a7e6f [Josh Rosen] Fix commented out test a1cba65 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180 7ba6db8 [Josh Rosen] Add utility to set system properties in tests. 4629d5c [Josh Rosen] Set spark.driver.allowMultipleContexts=true in tests. ed17e14 [Josh Rosen] Address review feedback; expose hack workaround for existing unit tests. 1c66070 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180 06c5c54 [Josh Rosen] Add / improve SparkContext cleanup in streaming BasicOperationsSuite d0437eb [Josh Rosen] StreamingContext.stop() should stop SparkContext even if StreamingContext has not been started yet. c4d35a2 [Josh Rosen] Log long form of creation site to aid debugging. 918e878 [Josh Rosen] Document "one SparkContext per JVM" limitation. afaa7e3 [Josh Rosen] [SPARK-4180] Prevent creations of multiple active SparkContexts.
* [DOCS][SQL] Fix broken link to Row class scaladocAndy Konwinski2014-11-171-1/+1
| | | | | | | | Author: Andy Konwinski <andykonwinski@gmail.com> Closes #3323 from andyk/patch-2 and squashes the following commits: 4699fdc [Andy Konwinski] Fix broken link to Row class scaladoc
* Revert "[SPARK-4075] [Deploy] Jar url validation is not enough for Jar file"Andrew Or2014-11-172-16/+1
| | | | This reverts commit 098f83c7ccd7dad9f9228596da69fe5f55711a52.
* [SPARK-4444] Drop VD type parameter from EdgeRDDAnkur Dave2014-11-177-50/+40
| | | | | | | | | | | | | Due to vertex attribute caching, EdgeRDD previously took two type parameters: ED and VD. However, this is an implementation detail that should not be exposed in the interface, so this PR drops the VD type parameter. This requires removing the `filter` method from the EdgeRDD interface, because it depends on vertex attribute caching. Author: Ankur Dave <ankurdave@gmail.com> Closes #3303 from ankurdave/edgerdd-drop-tparam and squashes the following commits: 38dca9b [Ankur Dave] Leave EdgeRDD.fromEdges public fafeb51 [Ankur Dave] Drop VD type parameter from EdgeRDD
* SPARK-2811 upgrade algebird to 0.8.1Adam Pingel2014-11-173-7/+7
| | | | | | | | | Author: Adam Pingel <adam@axle-lang.org> Closes #3282 from adampingel/master and squashes the following commits: 70c8d3c [Adam Pingel] relocate the algebird example back to example/src 7a9d8be [Adam Pingel] SPARK-2811 upgrade algebird to 0.8.1
* SPARK-4445, Don't display storage level in toDebugString unless RDD is ↵Prashant Sharma2014-11-171-1/+1
| | | | | | | | | | persisted. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #3310 from ScrapCodes/SPARK-4445/rddDebugStringFix and squashes the following commits: 4e57c52 [Prashant Sharma] SPARK-4445, Don't display storage level in toDebugString unless RDD is persisted
* [SPARK-4410][SQL] Add support for external sortMichael Armbrust2014-11-164-6/+59
| | | | | | | | | | | | Adds a new operator that uses Spark's `ExternalSort` class. It is off by default now, but we might consider making it the default if benchmarks show that it does not regress performance. Author: Michael Armbrust <michael@databricks.com> Closes #3268 from marmbrus/externalSort and squashes the following commits: 48b9726 [Michael Armbrust] comments b98799d [Michael Armbrust] Add test afd7562 [Michael Armbrust] Add support for external sort.
* [SPARK-4422][MLLIB]In some cases, Vectors.fromBreeze get wrong results.GuoQiang Li2014-11-162-1/+8
| | | | | | | | | | | | cc mengxr Author: GuoQiang Li <witgo@qq.com> Closes #3281 from witgo/SPARK-4422 and squashes the following commits: 5f1fa5e [GuoQiang Li] import order 50783bd [GuoQiang Li] review commits 7a10123 [GuoQiang Li] In some cases, Vectors.fromBreeze get wrong results.
* Revert "[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, ↵Michael Armbrust2014-11-164-142/+115
| | | | | | | | | | and fixes for complex types" Author: Michael Armbrust <michael@databricks.com> Closes #3292 from marmbrus/revert4309 and squashes the following commits: 808e96e [Michael Armbrust] Revert "[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types"
* [SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes ↵Cheng Lian2014-11-164-115/+142
| | | | | | | | | | | | | | | | | | | for complex types SPARK-4407 was detected while working on SPARK-4309. Merged these two into a single PR since 1.2.0 RC is approaching. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3178) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3178 from liancheng/date-for-thriftserver and squashes the following commits: 6f71d0b [Cheng Lian] Makes toHiveString static 26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim a92882a [Cheng Lian] Updates HiveShim for 0.13.1 73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0)
* [SPARK-4393] Fix memory leak in ConnectionManager ACK timeout TimerTasks; ↵Josh Rosen2014-11-161-12/+35
| | | | | | | | | | | | | | | | | | | | use HashedWheelTimer This patch is intended to fix a subtle memory leak in ConnectionManager's ACK timeout TimerTasks: in the old code, each TimerTask held a reference to the message being sent and a cancelled TimerTask won't necessarily be garbage-collected until it's scheduled to run, so this caused huge buildups of messages that weren't garbage collected until their timeouts expired, leading to OOMs. This patch addresses this problem by capturing only the message ID in the TimerTask instead of the whole message, and by keeping a WeakReference to the promise in the TimerTask. I've also modified this code to use Netty's HashedWheelTimer, whose performance characteristics should be better for this use-case. Thanks to cristianopris for narrowing down this issue! Author: Josh Rosen <joshrosen@databricks.com> Closes #3259 from JoshRosen/connection-manager-timeout-bugfix and squashes the following commits: afcc8d6 [Josh Rosen] Address rxin's review feedback. 2a2e92d [Josh Rosen] Keep only WeakReference to promise in TimerTask; 0f0913b [Josh Rosen] Spelling fix: timout => timeout 3200c33 [Josh Rosen] Use Netty HashedWheelTimer f847dd4 [Josh Rosen] Don't capture entire message in ACK timeout task.
* [SPARK-4426][SQL][Minor] The symbol of BitwiseOr is wrong, should not be '&'Kousuke Saruta2014-11-151-1/+1
| | | | | | | | | | The symbol of BitwiseOr is defined as '&' but I think it's wrong. It should be '|'. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3284 from sarutak/bitwise-or-symbol-fix and squashes the following commits: aff4be5 [Kousuke Saruta] Fixed symbol of BitwiseOr
* [SPARK-4419] Upgrade snappy-java to 1.1.1.6Josh Rosen2014-11-151-1/+1
| | | | | | | | | | | | This upgrades snappy-java to 1.1.1.6, which includes a patch that improves error messages when attempting to deserialize empty inputs using SnappyInputStream (see xerial/snappy-java#89). We previously tried up upgrade to 1.1.1.5 in #2911 but reverted that patch after discovering a memory leak in snappy-java. This should leak have been fixed in 1.1.1.6, though (see xerial/snappy-java#92). Author: Josh Rosen <joshrosen@databricks.com> Closes #3287 from JoshRosen/SPARK-4419 and squashes the following commits: 5d6f4cc [Josh Rosen] [SPARK-4419] Upgrade snappy-java to 1.1.1.6.
* [SPARK-2321] Several progress API improvements / refactoringsJosh Rosen2014-11-147-172/+269
| | | | | | | | | | | | | | | | | | | | This PR refactors / extends the status API introduced in #2696. - Change StatusAPI from a mixin trait to a class. Before, the new status API methods were directly accessible through SparkContext, whereas now they're accessed through a `sc.statusAPI` field. As long as we were going to add these methods directly to SparkContext, the mixin trait seemed like a good idea, but this might be simpler to reason about and may avoid pitfalls that I've run into while attempting to refactor other parts of SparkContext to use mixins (see #3071, for example). - Change the name from SparkStatusAPI to SparkStatusTracker. - Make `getJobIdsForGroup(null)` return ids for jobs that aren't associated with any job group. - Add `getActiveStageIds()` and `getActiveJobIds()` methods that return the ids of whatever's currently active in this SparkContext. This should simplify davies's progress bar code. Author: Josh Rosen <joshrosen@databricks.com> Closes #3197 from JoshRosen/progress-api-improvements and squashes the following commits: 30b0afa [Josh Rosen] Rename SparkStatusAPI to SparkStatusTracker. d1b08d8 [Josh Rosen] Add missing newlines 2cc7353 [Josh Rosen] Add missing file. d5eab1f [Josh Rosen] Add getActive[Stage|Job]Ids() methods. a227984 [Josh Rosen] getJobIdsForGroup(null) should return jobs for default group c47e294 [Josh Rosen] Remove StatusAPI mixin trait.
* Added contains(key) to Metadatakai2014-11-142-0/+16
| | | | | | | | | | | Add contains(key) to org.apache.spark.sql.catalyst.util.Metadata to test the existence of a key. Otherwise, Class Metadata's get methods may throw NoSuchElement exception if the key does not exist. Testcases are added to MetadataSuite as well. Author: kai <kaizeng@eecs.berkeley.edu> Closes #3273 from kai-zeng/metadata-fix and squashes the following commits: 74b3d03 [kai] Added contains(key) to Metadata