aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-12204][SPARKR] Implement drop method for DataFrame in SparkR.Sun Rui2016-01-206-27/+101
| | | | | | Author: Sun Rui <rui.sun@intel.com> Closes #10201 from sun-rui/SPARK-12204.
* [SPARK-12910] Fixes : R version for installing sparkRShubhanshu Mishra2016-01-202-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Testing code: ``` $ ./install-dev.sh USING R_HOME = /usr/bin ERROR: this R is version 2.15.1, package 'SparkR' requires R >= 3.0 ``` Using the new argument: ``` $ ./install-dev.sh /content/username/SOFTWARE/R-3.2.3 USING R_HOME = /content/username/SOFTWARE/R-3.2.3/bin * installing *source* package ‘SparkR’ ... ** R ** inst ** preparing package for lazy loading Creating a new generic function for ‘colnames’ in package ‘SparkR’ Creating a new generic function for ‘colnames<-’ in package ‘SparkR’ Creating a new generic function for ‘cov’ in package ‘SparkR’ Creating a new generic function for ‘na.omit’ in package ‘SparkR’ Creating a new generic function for ‘filter’ in package ‘SparkR’ Creating a new generic function for ‘intersect’ in package ‘SparkR’ Creating a new generic function for ‘sample’ in package ‘SparkR’ Creating a new generic function for ‘transform’ in package ‘SparkR’ Creating a new generic function for ‘subset’ in package ‘SparkR’ Creating a new generic function for ‘summary’ in package ‘SparkR’ Creating a new generic function for ‘lag’ in package ‘SparkR’ Creating a new generic function for ‘rank’ in package ‘SparkR’ Creating a new generic function for ‘sd’ in package ‘SparkR’ Creating a new generic function for ‘var’ in package ‘SparkR’ Creating a new generic function for ‘predict’ in package ‘SparkR’ Creating a new generic function for ‘rbind’ in package ‘SparkR’ Creating a generic function for ‘lapply’ from package ‘base’ in package ‘SparkR’ Creating a generic function for ‘Filter’ from package ‘base’ in package ‘SparkR’ Creating a generic function for ‘alias’ from package ‘stats’ in package ‘SparkR’ Creating a generic function for ‘substr’ from package ‘base’ in package ‘SparkR’ Creating a generic function for ‘%in%’ from package ‘base’ in package ‘SparkR’ Creating a generic function for ‘mean’ from package ‘base’ in package ‘SparkR’ Creating a generic function for ‘unique’ from package ‘base’ in package ‘SparkR’ Creating a generic function for ‘nrow’ from package ‘base’ in package ‘SparkR’ Creating a generic function for ‘ncol’ from package ‘base’ in package ‘SparkR’ Creating a generic function for ‘head’ from package ‘utils’ in package ‘SparkR’ Creating a generic function for ‘factorial’ from package ‘base’ in package ‘SparkR’ Creating a generic function for ‘atan2’ from package ‘base’ in package ‘SparkR’ Creating a generic function for ‘ifelse’ from package ‘base’ in package ‘SparkR’ ** help No man pages found in package ‘SparkR’ *** installing help indices ** building package indices ** testing if installed package can be loaded * DONE (SparkR) ``` Author: Shubhanshu Mishra <smishra8@illinois.edu> Closes #10836 from napsternxg/master.
* [SPARK-8968] [SQL] [HOT-FIX] Fix scala 2.11 build.Yin Huai2016-01-201-1/+1
|
* [SPARK-8968][SQL] external sort by the partition clomns when dynamic ↵wangfei2016-01-202-99/+166
| | | | | | | | | | | | | | | | | partitioning to optimize the memory overhead Now the hash based writer dynamic partitioning show the bad performance for big data and cause many small files and high GC. This patch we do external sort first so that each time we only need open one writer. before this patch: ![gc](https://cloud.githubusercontent.com/assets/7018048/9149788/edc48c6e-3dec-11e5-828c-9995b56e4d65.PNG) after this patch: ![gc-optimize-externalsort](https://cloud.githubusercontent.com/assets/7018048/9149794/60f80c9c-3ded-11e5-8a56-7ae18ddc7a2f.png) Author: wangfei <wangfei_hello@126.com> Author: scwf <wangfei1@huawei.com> Closes #7336 from scwf/dynamic-optimize-basedon-apachespark.
* [SPARK-12797] [SQL] Generated TungstenAggregate (without grouping keys)Davies Liu2016-01-205-12/+111
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As discussed in #10786, the generated TungstenAggregate does not support imperative functions. For a query ``` sqlContext.range(10).filter("id > 1").groupBy().count() ``` The generated code will looks like: ``` /* 032 */ if (!initAgg0) { /* 033 */ initAgg0 = true; /* 034 */ /* 035 */ // initialize aggregation buffer /* 037 */ long bufValue2 = 0L; /* 038 */ /* 039 */ /* 040 */ // initialize Range /* 041 */ if (!range_initRange5) { /* 042 */ range_initRange5 = true; ... /* 071 */ } /* 072 */ /* 073 */ while (!range_overflow8 && range_number7 < range_partitionEnd6) { /* 074 */ long range_value9 = range_number7; /* 075 */ range_number7 += 1L; /* 076 */ if (range_number7 < range_value9 ^ 1L < 0) { /* 077 */ range_overflow8 = true; /* 078 */ } /* 079 */ /* 085 */ boolean primitive11 = false; /* 086 */ primitive11 = range_value9 > 1L; /* 087 */ if (!false && primitive11) { /* 092 */ // do aggregate and update aggregation buffer /* 099 */ long primitive17 = -1L; /* 100 */ primitive17 = bufValue2 + 1L; /* 101 */ bufValue2 = primitive17; /* 105 */ } /* 107 */ } /* 109 */ /* 110 */ // output the result /* 112 */ bufferHolder25.reset(); /* 114 */ rowWriter26.initialize(bufferHolder25, 1); /* 118 */ rowWriter26.write(0, bufValue2); /* 120 */ result24.pointTo(bufferHolder25.buffer, bufferHolder25.totalSize()); /* 121 */ currentRow = result24; /* 122 */ return; /* 124 */ } /* 125 */ ``` cc nongli Author: Davies Liu <davies@databricks.com> Closes #10840 from davies/gen_agg.
* [SPARK-12848][SQL] Change parsed decimal literal datatype from Double to DecimalHerman van Hovell2016-01-2028-59/+83
| | | | | | | | | | | | | | The current parser turns a decimal literal, for example ```12.1```, into a Double. The problem with this approach is that we convert an exact literal into a non-exact ```Double```. The PR changes this behavior, a Decimal literal is now converted into an extact ```BigDecimal```. The behavior for scientific decimals, for example ```12.1e01```, is unchanged. This will be converted into a Double. This PR replaces the ```BigDecimal``` literal by a ```Double``` literal, because the ```BigDecimal``` is the default now. You can use the double literal by appending a 'D' to the value, for instance: ```3.141527D``` cc davies rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10796 from hvanhovell/SPARK-12848.
* [SPARK-12888][SQL] benchmark the new hash expressionWenchen Fan2016-01-201-0/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Benchmark it on 4 different schemas, the result: ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For simple: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 31.47 266.54 1.00 X codegen version 64.52 130.01 0.49 X ``` ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For normal: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 4068.11 0.26 1.00 X codegen version 1175.92 0.89 3.46 X ``` ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For array: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 9276.70 0.06 1.00 X codegen version 14762.23 0.04 0.63 X ``` ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For map: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 58869.79 0.01 1.00 X codegen version 9285.36 0.06 6.34 X ``` Author: Wenchen Fan <wenchen@databricks.com> Closes #10816 from cloud-fan/hash-benchmark.
* [SPARK-12616][SQL] Making Logical Operator `Union` Support Arbitrary Number ↵gatorsmile2016-01-2020-122/+322
| | | | | | | | | | | | | | of Children The existing `Union` logical operator only supports two children. Thus, adding a new logical operator `Unions` which can have arbitrary number of children to replace the existing one. `Union` logical plan is a binary node. However, a typical use case for union is to union a very large number of input sources (DataFrames, RDDs, or files). It is not uncommon to union hundreds of thousands of files. In this case, our optimizer can become very slow due to the large number of logical unions. We should change the Union logical plan to support an arbitrary number of children, and add a single rule in the optimizer to collapse all adjacent `Unions` into a single `Unions`. Note that this problem doesn't exist in physical plan, because the physical `Unions` already supports arbitrary number of children. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10577 from gatorsmile/unionAllMultiChildren.
* [SPARK-7799][SPARK-12786][STREAMING] Add "streaming-akka" projectShixiong Zhu2016-01-2022-185/+601
| | | | | | | | | | | | | Include the following changes: 1. Add "streaming-akka" project and org.apache.spark.streaming.akka.AkkaUtils for creating an actorStream 2. Remove "StreamingContext.actorStream" and "JavaStreamingContext.actorStream" 3. Update the ActorWordCount example and add the JavaActorWordCount example 4. Make "streaming-zeromq" depend on "streaming-akka" and update the codes accordingly Author: Shixiong Zhu <shixiong@databricks.com> Closes #10744 from zsxwing/streaming-akka-2.
* [SPARK-12847][CORE][STREAMING] Remove StreamingListenerBus and post all ↵Shixiong Zhu2016-01-2014-231/+269
| | | | | | | | | | | | | | | Streaming events to the same thread as Spark events Including the following changes: 1. Add StreamingListenerForwardingBus to WrappedStreamingListenerEvent process events in `onOtherEvent` to StreamingListener 2. Remove StreamingListenerBus 3. Merge AsynchronousListenerBus and LiveListenerBus to the same class LiveListenerBus 4. Add `logEvent` method to SparkListenerEvent so that EventLoggingListener can use it to ignore WrappedStreamingListenerEvents Author: Shixiong Zhu <shixiong@databricks.com> Closes #10779 from zsxwing/streaming-listener.
* [SPARK-10263][ML] Add @Since annotation to ml.param and ml.*Takahashi Hiroshi2016-01-202-5/+42
| | | | | | | | | Add Since annotations to ml.param and ml.* Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp> Author: Hiroshi Takahashi <takahashi.hiroshi@lab.ntt.co.jp> Closes #8935 from taishi-oss/issue10263.
* [SPARK-12898] Consider having dummyCallSite for HiveTableScanRajesh Balamohan2016-01-201-3/+10
| | | | | | | | Currently, HiveTableScan runs with getCallSite which is really expensive and shows up when scanning through large table with partitions (e.g TPC-DS) which slows down the overall runtime of the job. It would be good to consider having dummyCallSite in HiveTableScan. Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #10825 from rajeshbalamohan/SPARK-12898.
* [SPARK-12925][SQL] Improve HiveInspectors.unwrap for StringObjectIns…Rajesh Balamohan2016-01-201-1/+3
| | | | | | | | Text is in UTF-8 and converting it via "UTF8String.fromString" incurs decoding and encoding, which turns out to be expensive and redundant. Profiler snapshot details is attached in the JIRA (ref:https://issues.apache.org/jira/secure/attachment/12783331/SPARK-12925_profiler_cpu_samples.png) Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #10848 from rajeshbalamohan/SPARK-12925.
* [SPARK-12230][ML] WeightedLeastSquares.fit() should handle division by zero ↵Imran Younus2016-01-202-7/+83
| | | | | | | | | | properly if standard deviation of target variable is zero. This fixes the behavior of WeightedLeastSquars.fit() when the standard deviation of the target variable is zero. If the fitIntercept is true, there is no need to train. Author: Imran Younus <iyounus@us.ibm.com> Closes #10274 from iyounus/SPARK-12230_bug_fix_in_weighted_least_squares.
* [SPARK-11295][PYSPARK] Add packages to JUnit output for Python testsGábor Lipták2016-01-205-11/+19
| | | | | | | | | This is #9263 from gliptak (improving grouping/display of test case results) with a small fix of bisecting k-means unit test. Author: Gábor Lipták <gliptak@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #10850 from mengxr/SPARK-11295.
* [SPARK-6519][ML] Add spark.ml API for bisecting k-meansYu ISHIKAWA2016-01-202-0/+281
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9604 from yu-iskw/SPARK-6519.
* [SPARK-12881] [SQL] subexpress elimination in mutable projectionDavies Liu2016-01-2011-27/+80
| | | | | | Author: Davies Liu <davies@databricks.com> Closes #10814 from davies/mutable_subexpr.
* [SPARK-12912][SQL] Add a test suite for EliminateSubQueriesReynold Xin2016-01-204-26/+103
| | | | | | | | Also updated documentation to explain why ComputeCurrentTime and EliminateSubQueries are in the optimizer rather than analyzer. Author: Reynold Xin <rxin@databricks.com> Closes #10837 from rxin/optimizer-analyzer-comment.
* [SPARK-12871][SQL] Support to specify the option for compression codec.hyukjinkwon2016-01-193-2/+70
| | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-12871 This PR added an option to support to specify compression codec. This adds the option `codec` as an alias `compression` as filed in [SPARK-12668 ](https://issues.apache.org/jira/browse/SPARK-12668). Note that I did not add configurations for Hadoop 1.x as this `CsvRelation` is using Hadoop 2.x API and I guess it is going to drop Hadoop 1.x support. Author: hyukjinkwon <gurwls223@gmail.com> Closes #10805 from HyukjinKwon/SPARK-12420.
* [SPARK-12232][SPARKR] New R API for read.table to avoid name conflictfelixcheung2016-01-195-27/+21
| | | | | | | | shivaram sorry it took longer to fix some conflicts, this is the change to add an alias for `table` Author: felixcheung <felixcheung_m@hotmail.com> Closes #10406 from felixcheung/readtable.
* Revert "[SPARK-11295] Add packages to JUnit output for Python tests"Xiangrui Meng2016-01-195-18/+10
| | | | This reverts commit c6f971b4aeca7265ab374fa46c5c452461d9b6a7.
* [SPARK-12337][SPARKR] Implement dropDuplicates() method of DataFrame in SparkR.Sun Rui2016-01-194-1/+75
| | | | | | Author: Sun Rui <rui.sun@intel.com> Closes #10309 from sun-rui/SPARK-12337.
* [SPARK-12168][SPARKR] Add automated tests for conflicted function in Rfelixcheung2016-01-192-1/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently this is reported when loading the SparkR package in R (probably would add is.nan) ``` Loading required package: methods Attaching package: ‘SparkR’ The following objects are masked from ‘package:stats’: cov, filter, lag, na.omit, predict, sd, var The following objects are masked from ‘package:base’: colnames, colnames<-, intersect, rank, rbind, sample, subset, summary, table, transform ``` Adding this test adds an automated way to track changes to masked method. Also, the second part of this test check for those functions that would not be accessible without namespace/package prefix. Incidentally, this might point to how we would fix those inaccessible functions in base or stats. Looking for feedback for adding this test. Author: felixcheung <felixcheung_m@hotmail.com> Closes #10171 from felixcheung/rmaskedtest.
* [SPARK-12770][SQL] Implement rules for branch elimination for CaseWhenReynold Xin2016-01-192-0/+55
| | | | | | | | | | | | The three optimization cases are: 1. If the first branch's condition is a true literal, remove the CaseWhen and use the value from that branch. 2. If a branch's condition is a false or null literal, remove that branch. 3. If only the else branch is left, remove the CaseWhen and use the value from the else branch. Author: Reynold Xin <rxin@databricks.com> Closes #10827 from rxin/SPARK-12770.
* [SPARK-9716][ML] BinaryClassificationEvaluator should accept Double ↵BenFradet2016-01-194-5/+58
| | | | | | | | | | prediction column This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10472 from BenFradet/SPARK-9716.
* [SPARK-2750][WEB UI] Add https support to the Web UIscwf2016-01-1922-93/+338
| | | | | | | | | Author: scwf <wangfei1@huawei.com> Author: Marcelo Vanzin <vanzin@cloudera.com> Author: WangTaoTheTonic <wangtao111@huawei.com> Author: w00228970 <wangfei1@huawei.com> Closes #10238 from vanzin/SPARK-2750.
* [BUILD] Runner for spark packagesMichael Armbrust2016-01-191-0/+15
| | | | | | | | | | | | | | This is a convenience method added to the SBT build for developers, though if people think its useful we could consider adding a official script that runs using the assembly instead of compiling on demand. It simply compiles spark (without requiring an assembly), and invokes Spark Submit to download / run the package. Example Usage: ``` $ build/sbt > sparkPackage com.databricks:spark-sql-perf_2.10:0.2.4 com.databricks.spark.sql.perf.RunBenchmark --help ``` Author: Michael Armbrust <michael@databricks.com> Closes #10834 from marmbrus/sparkPackageRunner.
* [SPARK-11295] Add packages to JUnit output for Python testsGábor Lipták2016-01-195-10/+18
| | | | | | | | | | SPARK-11295 Add packages to JUnit output for Python tests This improves grouping/display of test case results. Author: Gábor Lipták <gliptak@gmail.com> Closes #9263 from gliptak/SPARK-11295.
* [SPARK-12816][SQL] De-alias type when generating schemasJakob Odersky2016-01-192-1/+12
| | | | | | | | | | | | | | Call `dealias` on local types to fix schema generation for abstract type members, such as ```scala type KeyValue = (Int, String) ``` Add simple test Author: Jakob Odersky <jodersky@gmail.com> Closes #10749 from jodersky/aliased-schema.
* [SPARK-12560][SQL] SqlTestUtils.stripSparkFilter needs to copy utf8stringsImran Rashid2016-01-191-1/+1
| | | | | | | | | | See https://issues.apache.org/jira/browse/SPARK-12560 This isn't causing any problems currently because the tests for string predicate pushdown are currently disabled. I ran into this while trying to turn them back on with a different version of parquet. Figure it was good to fix now in any case. Author: Imran Rashid <irashid@cloudera.com> Closes #10510 from squito/SPARK-12560.
* [SPARK-12867][SQL] Nullability of Intersect can be strictergatorsmile2016-01-192-6/+33
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-12867 When intersecting one nullable column with one non-nullable column, the result will not contain any null. Thus, we can make nullability of `intersect` stricter. liancheng Could you please check if the code changes are appropriate? Also added test cases to verify the results. Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #10812 from gatorsmile/nullabilityIntersect.
* [SPARK-12804][ML] Fix LogisticRegression with FitIntercept on all same label ↵Feynman Liang2016-01-192-95/+148
| | | | | | | | | | training data CC jkbradley mengxr dbtsai Author: Feynman Liang <feynman.liang@gmail.com> Closes #10743 from feynmanliang/SPARK-12804.
* [SPARK-12887] Do not expose var's in TaskMetricsAndrew Or2016-01-1927-246/+281
| | | | | | | | | | | | | | | | This is a step in implementing SPARK-10620, which migrates TaskMetrics to accumulators. TaskMetrics has a bunch of var's, some are fully public, some are `private[spark]`. This is bad coding style that makes it easy to accidentally overwrite previously set metrics. This has happened a few times in the past and caused bugs that were difficult to debug. Instead, we should have get-or-create semantics, which are more readily understandable. This makes sense in the case of TaskMetrics because these are just aggregated metrics that we want to collect throughout the task, so it doesn't matter who's incrementing them. Parent PR: #10717 Author: Andrew Or <andrew@databricks.com> Author: Josh Rosen <joshrosen@databricks.com> Author: andrewor14 <andrew@databricks.com> Closes #10815 from andrewor14/get-or-create-metrics.
* [SPARK-12870][SQL] better format bucket id in file nameWenchen Fan2016-01-194-7/+13
| | | | | | | | for normal parquet file without bucket, it's file name ends with a jobUUID which maybe all numbers and mistakeny regarded as bucket id. This PR improves the format of bucket id in file name by using a different seperator, `_`, so that the regex is more robust. Author: Wenchen Fan <wenchen@databricks.com> Closes #10799 from cloud-fan/fix-bucket.
* [SPARK-11944][PYSPARK][MLLIB] python mllib.clustering.bisecting k meansHolden Karau2016-01-193-5/+159
| | | | | | | | From the coverage issues for 1.6 : Add Python API for mllib.clustering.BisectingKMeans. Author: Holden Karau <holden@us.ibm.com> Closes #10150 from holdenk/SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans.
* [MLLIB] Fix CholeskyDecomposition assertion's messageWojciech Jurczyk2016-01-191-1/+1
| | | | | | | | Change assertion's message so it's consistent with the code. The old message says that the invoked method was lapack.dports, where in fact it was lapack.dppsv method. Author: Wojciech Jurczyk <wojtek.jurczyk@gmail.com> Closes #10818 from wjur/wjur/rename_error_message.
* [SPARK-7683][PYSPARK] Confusing behavior of fold function of RDD in pysparkSean Owen2016-01-191-1/+1
| | | | | | | | | | | | Fix order of arguments that Pyspark RDD.fold passes to its op - should be (acc, obj) like other implementations. Obviously, this is a potentially breaking change, so can only happen for 2.x CC davies Author: Sean Owen <sowen@cloudera.com> Closes #10771 from srowen/SPARK-7683.
* [SQL][MINOR] Fix one little mismatched comment according to the codes in ↵proflin2016-01-191-1/+1
| | | | | | | | interface.scala Author: proflin <proflin.me@gmail.com> Closes #10824 from proflin/master.
* [SPARK-12668][SQL] Providing aliases for CSV options to be similar to Pandas ↵hyukjinkwon2016-01-183-6/+20
| | | | | | | | | | | | | | | and R https://issues.apache.org/jira/browse/SPARK-12668 Spark CSV datasource has been being merged (filed in [SPARK-12420](https://issues.apache.org/jira/browse/SPARK-12420)). This is a quicky PR that simply renames several CSV options to similar Pandas and R. - Alias for delimiter ­-> sep - charset -­> encoding Author: hyukjinkwon <gurwls223@gmail.com> Closes #10800 from HyukjinKwon/SPARK-12668.
* [HOT][BUILD] Changed the import ordergatorsmile2016-01-182-2/+2
| | | | | | | | | | | | | This PR is to fix the master's build break. The following tests failed due to the import order issues in the master. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49651/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49652/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49653/consoleFull Author: gatorsmile <gatorsmile@gmail.com> Closes #10823 from gatorsmile/importOrder.
* [SPARK-12885][MINOR] Rename 3 fields in ShuffleWriteMetricsAndrew Or2016-01-1824-114/+126
| | | | | | | | | | | | | | | | | | | | | | This is a small step in implementing SPARK-10620, which migrates TaskMetrics to accumulators. This patch is strictly a cleanup patch and introduces no change in functionality. It literally just renames 3 fields for consistency. Today we have: ``` inputMetrics.recordsRead outputMetrics.bytesWritten shuffleReadMetrics.localBlocksFetched ... shuffleWriteMetrics.shuffleRecordsWritten shuffleWriteMetrics.shuffleBytesWritten shuffleWriteMetrics.shuffleWriteTime ``` The shuffle write ones are kind of redundant. We can drop the `shuffle` part in the method names. I added backward compatible (but deprecated) methods with the old names. Parent PR: #10717 Author: Andrew Or <andrew@databricks.com> Closes #10811 from andrewor14/rename-things.
* [SPARK-12700] [SQL] embed condition into SMJ and BroadcastHashJoinDavies Liu2016-01-186-72/+96
| | | | | | | | | | | | Currently SortMergeJoin and BroadcastHashJoin do not support condition, the need a followed Filter for that, the result projection to generate UnsafeRow could be very expensive if they generate lots of rows and could be filtered mostly by condition. This PR brings the support of condition for SortMergeJoin and BroadcastHashJoin, just like other outer joins do. This could improve the performance of Q72 by 7x (from 120s to 16.5s). Author: Davies Liu <davies@databricks.com> Closes #10653 from davies/filter_join.
* [SPARK-12889][SQL] Rename ParserDialect -> ParserInterface.Reynold Xin2016-01-187-10/+10
| | | | | | | | Based on discussions in #10801, I'm submitting a pull request to rename ParserDialect to ParserInterface. Author: Reynold Xin <rxin@databricks.com> Closes #10817 from rxin/SPARK-12889.
* [SPARK-12894][DOCUMENT] Add deploy instructions for Python in Kinesis ↵Shixiong Zhu2016-01-181-2/+12
| | | | | | | | | | integration doc This PR added instructions to get Kinesis assembly jar for Python users in the Kinesis integration page like Kafka doc. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10822 from zsxwing/kinesis-doc.
* Revert "[SPARK-12829] Turn Java style checker on"Shixiong Zhu2016-01-181-1/+2
| | | | This reverts commit 591c88c9e2a6c2e2ca84f1b66c635f198a16d112. `lint-java` doesn't work on a machine with a clean Maven cache.
* [SPARK-12814][DOCUMENT] Add deploy instructions for Python in flume ↵Shixiong Zhu2016-01-182-4/+13
| | | | | | | | | | integration doc This PR added instructions to get flume assembly jar for Python users in the flume integration page like Kafka doc. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10746 from zsxwing/flume-doc.
* [SPARK-12882][SQL] simplify bucket tests and add more commentsWenchen Fan2016-01-182-46/+78
| | | | | | | | Right now, the bucket tests are kind of hard to understand, this PR simplifies them and add more commetns. Author: Wenchen Fan <wenchen@databricks.com> Closes #10813 from cloud-fan/bucket-comment.
* [SPARK-12841][SQL] fix cast in filterWenchen Fan2016-01-183-8/+18
| | | | | | | | In SPARK-10743 we wrap cast with `UnresolvedAlias` to give `Cast` a better alias if possible. However, for cases like `filter`, the `UnresolvedAlias` can't be resolved and actually we don't need a better alias for this case. This PR move the cast wrapping logic to `Column.named` so that we will only do it when we need a alias name. Author: Wenchen Fan <wenchen@databricks.com> Closes #10781 from cloud-fan/bug.
* [SPARK-12855][SQL] Remove parser dialect developer APIReynold Xin2016-01-1813-139/+16
| | | | | | | | This pull request removes the public developer parser API for external parsers. Given everything a parser depends on (e.g. logical plans and expressions) are internal and not stable, external parsers will break with every release of Spark. It is a bad idea to create the illusion that Spark actually supports pluggable parsers. In addition, this also reduces incentives for 3rd party projects to contribute parse improvements back to Spark. Author: Reynold Xin <rxin@databricks.com> Closes #10801 from rxin/SPARK-12855.
* [SPARK-10985][CORE] Avoid passing evicted blocks throughout BlockManagerJosh Rosen2016-01-1814-241/+170
| | | | | | | | This patch refactors portions of the BlockManager and CacheManager in order to avoid having to pass `evictedBlocks` lists throughout the code. It appears that these lists were only consumed by `TaskContext.taskMetrics`, so the new code now directly updates the metrics from the lower-level BlockManager methods. Author: Josh Rosen <joshrosen@databricks.com> Closes #10776 from JoshRosen/SPARK-10985.