aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-14342][CORE][DOCS][TESTS] Remove straggler references to TachyonLiwei Lin2016-04-026-24/+24
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Straggler references to Tachyon were removed: - for docs, `tachyon` has been generalized as `off-heap memory`; - for Mesos test suits, the key-value `tachyon:true`/`tachyon:false` has been changed to `os:centos`/`os:ubuntu`, since `os` is an example constrain used by the [Mesos official docs](http://mesos.apache.org/documentation/attributes-resources/). ## How was this patch tested? Existing test suites. Author: Liwei Lin <lwlin7@gmail.com> Closes #12129 from lw-lin/tachyon-cleanup.
* [MINOR][DOCS] Use multi-line JavaDoc comments in Scala code.Dongjoon Hyun2016-04-0277-747/+786
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes. (All comment-only changes over 77 files: +786 lines, −747 lines) ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12130 from dongjoon-hyun/use_multiine_javadoc_comments.
* [SPARK-14338][SQL] Improve `SimplifyConditionals` rule to handle `null` in ↵Dongjoon Hyun2016-04-022-8/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IF/CASEWHEN ## What changes were proposed in this pull request? Currently, `SimplifyConditionals` handles `true` and `false` to optimize branches. This PR improves `SimplifyConditionals` to take advantage of `null` conditions for `if` and `CaseWhen` expressions, too. **Before** ``` scala> sql("SELECT IF(null, 1, 0)").explain() == Physical Plan == WholeStageCodegen : +- Project [if (null) 1 else 0 AS (IF(CAST(NULL AS BOOLEAN), 1, 0))#4] : +- INPUT +- Scan OneRowRelation[] scala> sql("select case when cast(null as boolean) then 1 else 2 end").explain() == Physical Plan == WholeStageCodegen : +- Project [CASE WHEN null THEN 1 ELSE 2 END AS CASE WHEN CAST(NULL AS BOOLEAN) THEN 1 ELSE 2 END#14] : +- INPUT +- Scan OneRowRelation[] ``` **After** ``` scala> sql("SELECT IF(null, 1, 0)").explain() == Physical Plan == WholeStageCodegen : +- Project [0 AS (IF(CAST(NULL AS BOOLEAN), 1, 0))#4] : +- INPUT +- Scan OneRowRelation[] scala> sql("select case when cast(null as boolean) then 1 else 2 end").explain() == Physical Plan == WholeStageCodegen : +- Project [2 AS CASE WHEN CAST(NULL AS BOOLEAN) THEN 1 ELSE 2 END#4] : +- INPUT +- Scan OneRowRelation[] ``` **Hive** ``` hive> select if(null,1,2); OK 2 hive> select case when cast(null as boolean) then 1 else 2 end; OK 2 ``` ## How was this patch tested? Pass the Jenkins tests (including new extended test cases). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12122 from dongjoon-hyun/SPARK-14338.
* [HOTFIX] Disable StateStoreSuite.maintenanceReynold Xin2016-04-021-1/+1
|
* [MINOR] Typo fixesJacek Laskowski2016-04-0216-49/+52
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Typo fixes. No functional changes. ## How was this patch tested? Built the sources and ran with samples. Author: Jacek Laskowski <jacek@japila.pl> Closes #11802 from jaceklaskowski/typo-fixes.
* [HOTFIX] Fix compilation break.Reynold Xin2016-04-022-5/+4
|
* [MINOR][SQL] Fix comments styl and correct several styles and nits in CSV ↵hyukjinkwon2016-04-014-49/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | data source ## What changes were proposed in this pull request? While trying to create a PR (which was not an issue at the end), I just corrected some style nits. So, I removed the changes except for some coding style corrections. - According to the [scala-style-guide#documentation-style](https://github.com/databricks/scala-style-guide#documentation-style), Scala style comments are discouraged. >```scala >/** This is a correct one-liner, short description. */ > >/** > * This is correct multi-line JavaDoc comment. And > * this is my second line, and if I keep typing, this would be > * my third line. > */ > >/** In Spark, we don't use the ScalaDoc style so this > * is not correct. > */ >``` - Double newlines between consecutive methods was removed. According to [scala-style-guide#blank-lines-vertical-whitespace](https://github.com/databricks/scala-style-guide#blank-lines-vertical-whitespace), single newline appears when >Between consecutive members (or initializers) of a class: fields, constructors, methods, nested classes, static initializers, instance initializers. - Remove uesless parentheses in tests - Use `mapPartitions` instead of `mapPartitionsWithIndex()`. ## How was this patch tested? Unit tests were used and `dev/run_tests` for style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12109 from HyukjinKwon/SPARK-14271.
* [SPARK-14285][SQL] Implement common type-safe aggregate functionsReynold Xin2016-04-019-111/+342
| | | | | | | | | | | | ## What changes were proposed in this pull request? In the Dataset API, it is fairly difficult for users to perform simple aggregations in a type-safe way at the moment because there are no aggregators that have been implemented. This pull request adds a few common aggregate functions in expressions.scala.typed package, and also creates the expressions.java.typed package without implementation. The java implementation should probably come as a separate pull request. One challenge there is to resolve the type difference between Scala primitive types and Java boxed types. ## How was this patch tested? Added unit tests for them. Author: Reynold Xin <rxin@databricks.com> Closes #12077 from rxin/SPARK-14285.
* [SPARK-14251][SQL] Add SQL command for printing out generated code for debuggingDongjoon Hyun2016-04-017-31/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR implements `EXPLAIN CODEGEN` SQL command which returns generated codes like `debugCodegen`. In `spark-shell`, we don't need to `import debug` module. In `spark-sql`, we can use this SQL command now. **Before** ``` scala> import org.apache.spark.sql.execution.debug._ scala> sql("select 'a' as a group by 1").debugCodegen() Found 2 WholeStageCodegen subtrees. == Subtree 1 / 2 == ... Generated code: ... == Subtree 2 / 2 == ... Generated code: ... ``` **After** ``` scala> sql("explain extended codegen select 'a' as a group by 1").collect().foreach(println) [Found 2 WholeStageCodegen subtrees.] [== Subtree 1 / 2 ==] ... [] [Generated code:] ... [] [== Subtree 2 / 2 ==] ... [] [Generated code:] ... ``` ## How was this patch tested? Pass the Jenkins tests (including new testcases) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12099 from dongjoon-hyun/SPARK-14251.
* [SPARK-14138] [SQL] [MASTER] Fix generated SpecificColumnarIterator code can ↵Kazuaki Ishizaki2016-04-012-5/+51
| | | | | | | | | | | | | | | | exceed JVM size limit for cached DataFrames ## What changes were proposed in this pull request? This PR reduces Java byte code size of method in ```SpecificColumnarIterator``` by using a approach to make a group for lot of ```ColumnAccessor``` instantiations or method calls (more than 200) into a method ## How was this patch tested? Added a new unit test, which includes large instantiations and method calls, to ```InMemoryColumnarQuerySuite``` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #12108 from kiszk/SPARK-14138-master.
* [SPARK-14244][SQL] Don't use SizeBasedWindowFunction.n created on executor ↵Cheng Lian2016-04-014-9/+67
| | | | | | | | | | | | | | | | side when evaluating window functions ## What changes were proposed in this pull request? `SizeBasedWindowFunction.n` is a global singleton attribute created for evaluating size based aggregate window functions like `CUME_DIST`. However, this attribute gets different expression IDs when created on both driver side and executor side. This PR adds `withPartitionSize` method to `SizeBasedWindowFunction` so that we can easily rewrite `SizeBasedWindowFunction.n` on executor side. ## How was this patch tested? A test case is added in `HiveSparkSubmitSuite`, which supports launching multi-process clusters. Author: Cheng Lian <lian@databricks.com> Closes #12040 from liancheng/spark-14244-fix-sized-window-function.
* [SPARK-14308][ML][MLLIB] Remove unused mllib tree classes and move private ↵sethah2016-04-0118-409/+15
| | | | | | | | | | | | | | | | | | | | | | classes to ML ## What changes were proposed in this pull request? Decision tree helper classes will be migrated to ML. This patch moves those internal classes that are not part of the public API and removes ones that are no longer used, after [SPARK-12183](https://github.com/apache/spark/pull/11855). No functional changes are made. Details: * Bin.scala is removed as the ML implementation does not require bins * mllib NodeIdCache is removed. It was only used by the mllib implementation previously, which no longer exists * mllib TreePoint is removed. It was only used by the mllib implementation previously, which no longer exists * BaggedPoint, DTStatsAggregator, DecisionTreeMetadata, BaggedPointSuite and TimeTracker are all moved to ML. ## How was this patch tested? No functional changes are made. Existing unit tests ensure behavior is unchanged. Author: sethah <seth.hendrickson16@gmail.com> Closes #12097 from sethah/cleanup_mllib_tree.
* [SPARK-7425][ML] spark.ml Predictor should support other numeric types for labelBenFradet2016-04-0124-49/+294
| | | | | | | | Currently, the Predictor abstraction expects the input labelCol type to be DoubleType, but we should support other numeric types. This will involve updating the PredictorParams.validateAndTransformSchema method. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10355 from BenFradet/SPARK-7425.
* [SPARK-13241][WEB UI] Added long values for dates in ApplicationAttemptInfo APIAlex Bozarth2016-04-019-2/+92
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Adding long values for each Date in the ApplicationAttemptInfo API for easier use in code ## How was the this patch tested? Tested with dev/run-tests Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #11326 from ajbozarth/spark13241.
* [SPARK-12857][STREAMING] Standardize "records" and "events" on "records"Liwei Lin2016-04-014-40/+41
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently the Streaming tab in web UI uses records and events interchangeably; this PR tries to standardize them on "records". "records" is chosen over "events" because: - "records" is used extensively throughout the streaming documents, codes, and comments - "events" is used only in Streaming UI related codes and comments ## How was this patch tested? - existing test suites - manually checking on the Streaming UI tab Author: Liwei Lin <lwlin7@gmail.com> Closes #12032 from lw-lin/streaming-events-to-records.
* [SPARK-13825][CORE] Upgrade to Scala 2.11.8Jacek Laskowski2016-04-016-23/+23
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Upgrade to 2.11.8 (from the current 2.11.7) ## How was this patch tested? A manual build Author: Jacek Laskowski <jacek@japila.pl> Closes #11681 from jaceklaskowski/SPARK-13825-scala-2_11_8.
* [SPARK-14255][SQL] Streaming AggregationMichael Armbrust2016-04-0133-305/+827
| | | | | | | | | | | | | | | | | | | | | | This PR adds the ability to perform aggregations inside of a `ContinuousQuery`. In order to implement this feature, the planning of aggregation has augmented with a new `StatefulAggregationStrategy`. Unlike batch aggregation, stateful-aggregation uses the `StateStore` (introduced in #11645) to persist the results of partial aggregation across different invocations. The resulting physical plan performs the aggregation using the following progression: - Partial Aggregation - Shuffle - Partial Merge (now there is at most 1 tuple per group) - StateStoreRestore (now there is 1 tuple from this batch + optionally one from the previous) - Partial Merge (now there is at most 1 tuple per group) - StateStoreSave (saves the tuple for the next batch) - Complete (output the current result of the aggregation) The following refactoring was also performed to allow us to plug into existing code: - The get/put implementation is taken from #12013 - The logic for breaking down and de-duping the physical execution of aggregation has been move into a new pattern `PhysicalAggregation` - The `AttributeReference` used to identify the result of an `AggregateFunction` as been moved into the `AggregateExpression` container. This change moves the reference into the same object as the other intermediate references used in aggregation and eliminates the need to pass around a `Map[(AggregateFunction, Boolean), Attribute]`. Further clean up (using a different aggregation container for logical/physical plans) is deferred to a followup. - Some planning logic is moved from the `SessionState` into the `QueryExecution` to make it easier to override in the streaming case. - The ability to write a `StreamTest` that checks only the output of the last batch has been added to simulate the future addition of output modes. Author: Michael Armbrust <michael@databricks.com> Closes #12048 from marmbrus/statefulAgg.
* [SPARK-14316][SQL] StateStoreCoordinator should extend ThreadSafeRpcEndpointShixiong Zhu2016-04-012-7/+5
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? RpcEndpoint is not thread safe and allows multiple messages to be processed at the same time. StateStoreCoordinator should use ThreadSafeRpcEndpoint. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12100 from zsxwing/fix-StateStoreCoordinator.
* [SPARK-13992] Add support for off-heap cachingJosh Rosen2016-04-0120-176/+345
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds support for caching blocks in the executor processes using direct / off-heap memory. ## User-facing changes **Updated semantics of `OFF_HEAP` storage level**: In Spark 1.x, the `OFF_HEAP` storage level indicated that an RDD should be cached in Tachyon. Spark 2.x removed the external block store API that Tachyon caching was based on (see #10752 / SPARK-12667), so `OFF_HEAP` became an alias for `MEMORY_ONLY_SER`. As of this patch, `OFF_HEAP` means "serialized and cached in off-heap memory or on disk". Via the `StorageLevel` constructor, `useOffHeap` can be set if `serialized == true` and can be used to construct custom storage levels which support replication. **Storage UI reporting**: the storage UI will now report whether in-memory blocks are stored on- or off-heap. **Only supported by UnifiedMemoryManager**: for simplicity, this feature is only supported when the default UnifiedMemoryManager is used; applications which use the legacy memory manager (`spark.memory.useLegacyMode=true`) are not currently able to allocate off-heap storage memory, so using off-heap caching will fail with an error when legacy memory management is enabled. Given that we plan to eventually remove the legacy memory manager, this is not a significant restriction. **Memory management policies:** the policies for dividing available memory between execution and storage are the same for both on- and off-heap memory. For off-heap memory, the total amount of memory available for use by Spark is controlled by `spark.memory.offHeap.size`, which is an absolute size. Off-heap storage memory obeys `spark.memory.storageFraction` in order to control the amount of unevictable storage memory. For example, if `spark.memory.offHeap.size` is 1 gigabyte and Spark uses the default `storageFraction` of 0.5, then up to 500 megabytes of off-heap cached blocks will be protected from eviction due to execution memory pressure. If necessary, we can split `spark.memory.storageFraction` into separate on- and off-heap configurations, but this doesn't seem necessary now and can be done later without any breaking changes. **Use of off-heap memory does not imply use of off-heap execution (or vice-versa)**: for now, the settings controlling the use of off-heap execution memory (`spark.memory.offHeap.enabled`) and off-heap caching are completely independent, so Spark SQL can be configured to use off-heap memory for execution while continuing to cache blocks on-heap. If desired, we can change this in a followup patch so that `spark.memory.offHeap.enabled` affect the default storage level for cached SQL tables. ## Internal changes - Rename `ByteArrayChunkOutputStream` to `ChunkedByteBufferOutputStream` - It now returns a `ChunkedByteBuffer` instead of an array of byte arrays. - Its constructor now accept an `allocator` function which is called to allocate `ByteBuffer`s. This allows us to control whether it allocates regular ByteBuffers or off-heap DirectByteBuffers. - Because block serialization is now performed during the unroll process, a `ChunkedByteBufferOutputStream` which is configured with a `DirectByteBuffer` allocator will use off-heap memory for both unroll and storage memory. - The `MemoryStore`'s MemoryEntries now tracks whether blocks are stored on- or off-heap. - `evictBlocksToFreeSpace()` now accepts a `MemoryMode` parameter so that we don't try to evict off-heap blocks in response to on-heap memory pressure (or vice-versa). - Make sure that off-heap buffers are properly de-allocated during MemoryStore eviction. - The JVM limits the total size of allocated direct byte buffers using the `-XX:MaxDirectMemorySize` flag and the default tends to be fairly low (< 512 megabytes in some JVMs). To work around this limitation, this patch adds a custom DirectByteBuffer allocator which ignores this memory limit. Author: Josh Rosen <joshrosen@databricks.com> Closes #11805 from JoshRosen/off-heap-caching.
* [SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster ↵zhonghaihua2016-04-014-2/+29
| | | | | | | | | | | | | killed for max n… Currently, when max number of executor failures reached the `maxNumExecutorFailures`, `ApplicationMaster` will be killed and re-register another one.This time, `YarnAllocator` will be created a new instance. But, the value of property `executorIdCounter` in `YarnAllocator` will reset to `0`. Then the Id of new executor will starting from `1`. This will confuse with the executor has already created before, which will cause FetchFailedException. This situation is just in yarn client mode, so this is an issue in yarn client mode. For more details, [link to jira issues SPARK-12864](https://issues.apache.org/jira/browse/SPARK-12864) This PR introduce a mechanism to initialize `executorIdCounter` after `ApplicationMaster` killed. Author: zhonghaihua <793507405@qq.com> Closes #10794 from zhonghaihua/initExecutorIdCounterAfterAMKilled.
* [SPARK-13674] [SQL] Add wholestage codegen support to SampleLiang-Chi Hsieh2016-04-016-15/+104
| | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-13674 ## What changes were proposed in this pull request? Sample operator doesn't support wholestage codegen now. This pr is to add support to it. ## How was this patch tested? A test is added into `BenchmarkWholeStageCodegen`. Besides, all tests should be passed. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11517 from viirya/add-wholestage-sample.
* [SPARK-14160] Time Windowing functions for DatasetsBurak Yavuz2016-04-017-0/+735
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds the function `window` as a column expression. `window` can be used to bucket rows into time windows given a time column. With this expression, performing time series analysis on batch data, as well as streaming data should become much more simpler. ### Usage Assume the following schema: `sensor_id, measurement, timestamp` To average 5 minute data every 1 minute (window length of 5 minutes, slide duration of 1 minute), we will use: ```scala df.groupBy(window("timestamp", “5 minutes”, “1 minute”), "sensor_id") .agg(mean("measurement").as("avg_meas")) ``` This will generate windows such as: ``` 09:00:00-09:05:00 09:01:00-09:06:00 09:02:00-09:07:00 ... ``` Intervals will start at every `slideDuration` starting at the unix epoch (1970-01-01 00:00:00 UTC). To start intervals at a different point of time, e.g. 30 seconds after a minute, the `startTime` parameter can be used. ```scala df.groupBy(window("timestamp", “5 minutes”, “1 minute”, "30 second"), "sensor_id") .agg(mean("measurement").as("avg_meas")) ``` This will generate windows such as: ``` 09:00:30-09:05:30 09:01:30-09:06:30 09:02:30-09:07:30 ... ``` Support for Python will be made in a follow up PR after this. ## How was this patch tested? This patch has some basic unit tests for the `TimeWindow` expression testing that the parameters pass validation, and it also has some unit/integration tests testing the correctness of the windowing and usability in complex operations (multi-column grouping, multi-column projections, joins). Author: Burak Yavuz <brkyvz@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #12008 from brkyvz/df-time-window.
* [SPARK-14070][SQL] Use ORC data source for SQL queries on ORC tablesTejas Patil2016-04-016-75/+220
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch enables use of OrcRelation for SQL queries which read data from Hive tables. Changes in this patch: - Added a new rule `OrcConversions` which would alter the plan to use `OrcRelation`. In this diff, the conversion is done only for reads. - Added a new config `spark.sql.hive.convertMetastoreOrc` to control the conversion BEFORE ``` scala> hqlContext.sql("SELECT * FROM orc_table").explain(true) == Parsed Logical Plan == 'Project [unresolvedalias(*, None)] +- 'UnresolvedRelation `orc_table`, None == Analyzed Logical Plan == key: string, value: string Project [key#171,value#172] +- MetastoreRelation default, orc_table, None == Optimized Logical Plan == MetastoreRelation default, orc_table, None == Physical Plan == HiveTableScan [key#171,value#172], MetastoreRelation default, orc_table, None ``` AFTER ``` scala> hqlContext.sql("SELECT * FROM orc_table").explain(true) == Parsed Logical Plan == 'Project [unresolvedalias(*, None)] +- 'UnresolvedRelation `orc_table`, None == Analyzed Logical Plan == key: string, value: string Project [key#76,value#77] +- SubqueryAlias orc_table +- Relation[key#76,value#77] ORC part: struct<>, data: struct<key:string,value:string> == Optimized Logical Plan == Relation[key#76,value#77] ORC part: struct<>, data: struct<key:string,value:string> == Physical Plan == WholeStageCodegen : +- Scan ORC part: struct<>, data: struct<key:string,value:string>[key#76,value#77] InputPaths: file:/user/hive/warehouse/orc_table ``` ## How was this patch tested? - Added a new unit test. Ran existing unit tests - Ran with production like data ## Performance gains Ran on a production table in Facebook (note that the data was in DWRF file format which is similar to ORC) Best case : when there was no matching rows for the predicate in the query (everything is filtered out) ``` CPU time Wall time Total wall time across all tasks ================================================================ Without the change 541_515 sec 25.0 mins 165.8 hours With change 407 sec 1.5 mins 15 mins ``` Average case: A subset of rows in the data match the query predicate ``` CPU time Wall time Total wall time across all tasks ================================================================ Without the change 624_630 sec 31.0 mins 199.0 h With change 14_769 sec 5.3 mins 7.7 h ``` Author: Tejas Patil <tejasp@fb.com> Closes #11891 from tejasapatil/orc_ppd.
* [SPARK-14191][SQL] Remove invalid Expand operator constraintsLiang-Chi Hsieh2016-04-012-1/+31
| | | | | | | | | | | | | | | | | | | | | | | | | `Expand` operator now uses its child plan's constraints as its valid constraints (i.e., the base of constraints). This is not correct because `Expand` will set its group by attributes to null values. So the nullability of these attributes should be true. E.g., for an `Expand` operator like: val input = LocalRelation('a.int, 'b.int, 'c.int).where('c.attr > 10 && 'a.attr < 5 && 'b.attr > 2) Expand( Seq( Seq('c, Literal.create(null, StringType), 1), Seq('c, 'a, 2)), Seq('c, 'a, 'gid.int), Project(Seq('a, 'c), input)) The `Project` operator has the constraints `IsNotNull('a)`, `IsNotNull('b)` and `IsNotNull('c)`. But the `Expand` should not have `IsNotNull('a)` in its constraints. This PR is the first step for this issue and remove invalid constraints of `Expand` operator. A test is added to `ConstraintPropagationSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Michael Armbrust <michael@databricks.com> Closes #11995 from viirya/fix-expand-constraints.
* [SPARK-13995][SQL] Extract correct IsNotNull constraints for ExpressionLiang-Chi Hsieh2016-04-017-37/+134
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-13995 We infer relative `IsNotNull` constraints from logical plan's expressions in `constructIsNotNullConstraints` now. However, we don't consider the case of (nested) `Cast`. For example: val tr = LocalRelation('a.int, 'b.long) val plan = tr.where('a.attr === 'b.attr).analyze Then, the plan's constraints will have `IsNotNull(Cast(resolveColumn(tr, "a"), LongType))`, instead of `IsNotNull(resolveColumn(tr, "a"))`. This PR fixes it. Besides, as `IsNotNull` constraints are most useful for `Attribute`, we should do recursing through any `Expression` that is null intolerant and construct `IsNotNull` constraints for all `Attribute`s under these Expressions. For example, consider the following constraints: val df = Seq((1,2,3)).toDF("a", "b", "c") df.where("a + b = c").queryExecution.analyzed.constraints The inferred isnotnull constraints should be isnotnull(a), isnotnull(b), isnotnull(c), instead of isnotnull(a + c) and isnotnull(c). ## How was this patch tested? Test is added into `ConstraintPropagationSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #11809 from viirya/constraint-cast.
* [SPARK-14305][ML][PYSPARK] PySpark ml.clustering BisectingKMeans support ↵Yanbo Liang2016-04-011-2/+15
| | | | | | | | | | | | | | | export/import ## What changes were proposed in this pull request? PySpark ml.clustering BisectingKMeans support export/import ## How was this patch tested? doc test. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12112 from yanboliang/spark-14305.
* [SPARK-12343][YARN] Simplify Yarn client and client argumentjerryshao2016-04-0116-342/+186
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently in Spark on YARN, configurations can be passed through SparkConf, env and command arguments, some parts are duplicated, like client argument and SparkConf. So here propose to simplify the command arguments. ## How was this patch tested? This patch is tested manually with unit test. CC vanzin tgravescs , please help to suggest this proposal. The original purpose of this JIRA is to remove `ClientArguments`, through refactoring some arguments like `--class`, `--arg` are not so easy to replace, so here I remove the most part of command line arguments, only keep the minimal set. Author: jerryshao <sshao@hortonworks.com> Closes #11603 from jerryshao/SPARK-12343.
* [MINOR] [SQL] Update usage of `debug` by removing `typeCheck` and adding ↵Dongjoon Hyun2016-04-011-2/+2
| | | | | | | | | | | | | | | | | | `debugCodegen` ## What changes were proposed in this pull request? This PR updates the usage comments of `debug` according to the following commits. - [SPARK-9754](https://issues.apache.org/jira/browse/SPARK-9754) removed `typeCheck`. - [SPARK-14227](https://issues.apache.org/jira/browse/SPARK-14227) added `debugCodegen`. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12094 from dongjoon-hyun/minor_fix_debug_usage.
* [SPARK-14133][SQL] Throws exception for unsupported create/drop/alter index ↵sureshthalamati2016-04-013-6/+31
| | | | | | | | | | | | | | | | | | | | | | , and lock/unlock operations. ## What changes were proposed in this pull request? This PR throws Unsupported Operation exception for create index, drop index, alter index , lock table , lock database, unlock table, and unlock database operations that are not supported in Spark SQL. Currently these operations are executed executed by Hive. Error: spark-sql> drop index my_index on my_table; Error in query: Unsupported operation: drop index(line 1, pos 0) ## How was this patch tested? Added test cases to HiveQuerySuite yhuai hvanhovell andrewor14 Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #12069 from sureshthalamati/unsupported_ddl_spark-14133.
* [SPARK-14184][SQL] Support native execution of SHOW DATABASE command and fix ↵Dilip Biswal2016-04-018-18/+163
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SHOW TABLE to use table identifier pattern ## What changes were proposed in this pull request? This PR addresses the following 1. Supports native execution of SHOW DATABASES command 2. Fixes SHOW TABLES to apply the identifier_with_wildcards pattern if supplied. SHOW TABLE syntax ``` SHOW TABLES [IN database_name] ['identifier_with_wildcards']; ``` SHOW DATABASES syntax ``` SHOW (DATABASES|SCHEMAS) [LIKE 'identifier_with_wildcards']; ``` ## How was this patch tested? Tests added in SQLQuerySuite (both hive and sql contexts) and DDLCommandSuite Note: Since the table name pattern was not working , tests are added in both SQLQuerySuite to verify the application of the table pattern. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #11991 from dilipbiswal/dkb_show_database.
* [SPARK-14295][MLLIB][HOTFIX] Fixes Scala 2.10 compilation failureCheng Lian2016-04-011-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fixes a compilation failure introduced in PR #12088 under Scala 2.10. ## How was this patch tested? Compilation. Author: Cheng Lian <lian@databricks.com> Closes #12107 from liancheng/spark-14295-hotfix.
* [SPARK-14303][ML][SPARKR] Define and use KMeansWrapper for SparkR::kmeansYanbo Liang2016-03-313-80/+148
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Define and use ```KMeansWrapper``` for ```SparkR::kmeans```. It's only the code refactor for the original ```KMeans``` wrapper. ## How was this patch tested? Existing tests. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #12039 from yanboliang/spark-14059.
* [SPARK-11262][ML] Unit test for gradient, loss layers, memory management for ↵Alexander Ulanov2016-03-319-387/+601
| | | | | | | | | | | | | | | | | | | | | multilayer perceptron 1.Implement LossFunction trait and implement squared error and cross entropy loss with it 2.Implement unit test for gradient and loss 3.Implement InPlace trait and in-place layer evaluation 4.Refactor interface for ActivationFunction 5.Update of Layer and LayerModel interfaces 6.Fix random weights assignment 7.Implement memory allocation by MLP model instead of individual layers These features decreased the memory usage and increased flexibility of internal API. Author: Alexander Ulanov <nashb@yandex.ru> Author: avulanov <avulanov@gmail.com> Closes #9229 from avulanov/mlp-refactoring.
* [SPARK-14295][SPARK-14274][SQL] Implements buildReader() for LibSVMCheng Lian2016-03-315-34/+141
| | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR implements `FileFormat.buildReader()` for the LibSVM data source. Besides that, a new interface method `prepareRead()` is added to `FileFormat`: ```scala def prepareRead( sqlContext: SQLContext, options: Map[String, String], files: Seq[FileStatus]): Map[String, String] = options ``` After migrating from `buildInternalScan()` to `buildReader()`, we lost the opportunity to collect necessary global information, since `buildReader()` works in a per-partition manner. For example, LibSVM needs to infer the total number of features if the `numFeatures` data source option is not set. Any necessary collected global information should be returned using the data source options map. By default, this method just returns the original options untouched. An alternative approach is to absorb `inferSchema()` into `prepareRead()`, since schema inference is also some kind of global information gathering. However, this approach wasn't chosen because schema inference is optional, while `prepareRead()` must be called whenever a `HadoopFsRelation` based data source relation is instantiated. One unaddressed problem is that, when `numFeatures` is absent, now the input data will be scanned twice. The `buildInternalScan()` code path doesn't need to do this because it caches the raw parsed RDD in memory before computing the total number of features. However, with `FileScanRDD`, the raw parsed RDD is created in a different way (e.g. partitioning) from the final RDD. ## How was this patch tested? Tested using existing test suites. Author: Cheng Lian <lian@databricks.com> Closes #12088 from liancheng/spark-14295-libsvm-build-reader.
* [SPARK-14242][CORE][NETWORK] avoid copy in compositeBuffer for frame decoderZhang, Liye2016-03-311-1/+1
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In this patch, we set the initial `maxNumComponents` to `Integer.MAX_VALUE` instead of the default size ( which is 16) when allocating `compositeBuffer` in `TransportFrameDecoder` because `compositeBuffer` will introduce too many memory copies underlying if `compositeBuffer` is with default `maxNumComponents` when the frame size is large (which result in many transport messages). For details, please refer to [SPARK-14242](https://issues.apache.org/jira/browse/SPARK-14242). ## How was this patch tested? spark unit tests and manual tests. For manual tests, we can reproduce the performance issue with following code: `sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Double](1024 * 1024 * 50)).iterator).reduce((a,b)=> a).length` It's easy to see the performance gain, both from the running time and CPU usage. Author: Zhang, Liye <liye.zhang@intel.com> Closes #12038 from liyezhang556520/spark-14242.
* [SPARK-14267] [SQL] [PYSPARK] execute multiple Python UDFs within single batchDavies Liu2016-03-318-101/+233
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR support multiple Python UDFs within single batch, also improve the performance. ```python >>> from pyspark.sql.types import IntegerType >>> sqlContext.registerFunction("double", lambda x: x * 2, IntegerType()) >>> sqlContext.registerFunction("add", lambda x, y: x + y, IntegerType()) >>> sqlContext.sql("SELECT double(add(1, 2)), add(double(2), 1)").explain(True) == Parsed Logical Plan == 'Project [unresolvedalias('double('add(1, 2)), None),unresolvedalias('add('double(2), 1), None)] +- OneRowRelation$ == Analyzed Logical Plan == double(add(1, 2)): int, add(double(2), 1): int Project [double(add(1, 2))#14,add(double(2), 1)#15] +- Project [double(add(1, 2))#14,add(double(2), 1)#15] +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15] +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18] +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17] +- OneRowRelation$ == Optimized Logical Plan == Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15] +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18] +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17] +- OneRowRelation$ == Physical Plan == WholeStageCodegen : +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15] : +- INPUT +- !BatchPythonEvaluation [add(pythonUDF1#17, 1)], [pythonUDF0#16,pythonUDF1#17,pythonUDF0#18] +- !BatchPythonEvaluation [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17] +- Scan OneRowRelation[] ``` ## How was this patch tested? Added new tests. Using the following script to benchmark 1, 2 and 3 udfs, ``` df = sqlContext.range(1, 1 << 23, 1, 4) double = F.udf(lambda x: x * 2, LongType()) print df.select(double(df.id)).count() print df.select(double(df.id), double(df.id + 1)).count() print df.select(double(df.id), double(df.id + 1), double(df.id + 2)).count() ``` Here is the results: N | Before | After | speed up ---- |------------ | -------------|------ 1 | 22 s | 7 s | 3.1X 2 | 38 s | 13 s | 2.9X 3 | 58 s | 16 s | 3.6X This benchmark ran locally with 4 CPUs. For 3 UDFs, it launched 12 Python before before this patch, 4 process after this patch. After this patch, it will use less memory for multiple UDFs than before (less buffering). Author: Davies Liu <davies@databricks.com> Closes #12057 from davies/multi_udfs.
* [SPARK-14277][CORE] Upgrade Snappy Java to 1.1.2.4Sital Kedia2016-03-316-6/+6
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Upgrade snappy to 1.1.2.4 to improve snappy read/write performance. ## How was this patch tested? Tested by running a job on the cluster and saw 7.5% cpu savings after this change. Author: Sital Kedia <skedia@fb.com> Closes #12096 from sitalkedia/snappyRelease.
* [SPARK-14281][TESTS] Fix java8-tests and simplify their buildJosh Rosen2016-03-317-105/+46
| | | | | | | | This patch fixes a compilation / build break in Spark's `java8-tests` and refactors their POM to simplify the build. See individual commit messages for more details. Author: Josh Rosen <joshrosen@databricks.com> Closes #12073 from JoshRosen/fix-java8-tests.
* [SPARK-14264][PYSPARK][ML] Add feature importance for GBTs in pysparksethah2016-03-312-20/+46
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Feature importances are exposed in the python API for GBTs. Other changes: * Update the random forest feature importance documentation to not repeat decision tree docstring and instead place a reference to it. ## How was this patch tested? Python doc tests were updated to validate GBT feature importance. Author: sethah <seth.hendrickson16@gmail.com> Closes #12056 from sethah/Pyspark_GBT_feature_importance.
* [SPARK-14304][SQL][TESTS] Fix tests that don't create temp files in the ↵Shixiong Zhu2016-03-316-22/+24
| | | | | | | | | | | | | | | | `java.io.tmpdir` folder ## What changes were proposed in this pull request? If I press `CTRL-C` when running these tests, the temp files will be left in `sql/core` folder and I need to delete them manually. It's annoying. This PR just moves the temp files to the `java.io.tmpdir` folder and add a name prefix for them. ## How was this patch tested? Existing Jenkins tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #12093 from zsxwing/temp-file.
* [SPARK-13710][SHELL][WINDOWS] Fix jline dependency on WindowsMichel Lemay2016-03-311-0/+4
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Exclude jline from curator-recipes since it conflicts with scala 2.11 when running spark-shell. Should not affect scala 2.10 since it is builtin. ## How was this patch tested? Ran spark-shell manually. Author: Michel Lemay <mlemay@gmail.com> Closes #12043 from michellemay/spark-13710-fix-jline-on-windows.
* [SPARK-11327][MESOS] Dispatcher does not respect all args from the Submit ↵Jo Voordeckers2016-03-312-0/+62
| | | | | | | | | | | request Supersedes https://github.com/apache/spark/pull/9752 Author: Jo Voordeckers <jo.voordeckers@gmail.com> Author: Iulian Dragos <jaguarul@gmail.com> Closes #10370 from jayv/mesos_cluster_params.
* [SPARK-14069][SQL] Improve SparkStatusTracker to also track executor informationWenchen Fan2016-03-316-16/+80
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Track executor information like host and port, cache size, running tasks. TODO: tests ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #11888 from cloud-fan/status-tracker.
* [Docs] Update monitoring.md to accurately describe the history serverMichael Gummelt2016-03-311-29/+29
| | | | | | | | | | | | | | It looks like the docs were recently updated to reflect the History Server's support for incomplete applications, but they still had wording that suggested only completed applications were viewable. This fixes that. My editor also introduced several whitespace removal changes, that I hope are OK, as text files shouldn't have trailing whitespace. To verify they're purely whitespace changes, add `&w=1` to your browser address. If this isn't acceptable, let me know and I'll update the PR. I also didn't think this required a JIRA. Let me know if I should create one. Not tested Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #12045 from mgummelt/update-history-docs.
* [SPARK-14243][CORE] update task metrics when removing blocksjeanlyn2016-03-312-2/+15
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR try to use `incUpdatedBlockStatuses ` to update the `updatedBlockStatuses ` when removing blocks, making sure `BlockManager` correctly updates `updatedBlockStatuses` ## How was this patch tested? test("updated block statuses") in BlockManagerSuite.scala Author: jeanlyn <jeanlyn92@gmail.com> Closes #12091 from jeanlyn/updateBlock.
* [SPARK-14182][SQL] Parse DDL Command: Alter Viewgatorsmile2016-03-314-55/+177
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR is to provide native parsing support for DDL commands: `Alter View`. Since its AST trees are highly similar to `Alter Table`. Thus, both implementation are integrated into the same one. Based on the Hive DDL document: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL and https://cwiki.apache.org/confluence/display/Hive/PartitionedViews **Syntax:** ```SQL ALTER VIEW view_name RENAME TO new_view_name ``` - to change the name of a view to a different name **Syntax:** ```SQL ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment); ``` - to add metadata to a view **Syntax:** ```SQL ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key') ``` - to remove metadata from a view **Syntax:** ```SQL ALTER VIEW view_name ADD [IF NOT EXISTS] PARTITION spec1[, PARTITION spec2, ...] ``` - to add the partitioning metadata for a view. - the syntax of partition spec in `ALTER VIEW` is identical to `ALTER TABLE`, **EXCEPT** that it is **ILLEGAL** to specify a `LOCATION` clause. **Syntax:** ```SQL ALTER VIEW view_name DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...] ``` - to drop the related partition metadata for a view. Added the related test cases to `DDLCommandSuite` Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11987 from gatorsmile/parseAlterView.
* [SPARK-13796] Redirect error message to logWarningNishkam Ravi2016-03-311-1/+1
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Redirect error message to logWarning ## How was this patch tested? Unit tests, manual tests JoshRosen Author: Nishkam Ravi <nishkamravi@gmail.com> Closes #12052 from nishkamravi2/master_warning.
* [SPARK-14278][SQL] Initialize columnar batch with proper memory modeSameer Agarwal2016-03-311-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fixes a minor bug in the record reader constructor that was possibly introduced during refactoring. ## How was this patch tested? N/A Author: Sameer Agarwal <sameer@databricks.com> Closes #12070 from sameeragarwal/vectorized-rr.
* [SPARK-14263][SQL] Benchmark Vectorized HashMap for GroupBy AggregatesSameer Agarwal2016-03-312-10/+142
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR proposes a new data-structure based on a vectorized hashmap that can be potentially _codegened_ in `TungstenAggregate` to speed up aggregates with group by. Micro-benchmarks show a 10x improvement over the current `BytesToBytes` aggregation map. ## How was this patch tested? Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz BytesToBytesMap: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- hash 108 / 119 96.9 10.3 1.0X fast hash 63 / 70 166.2 6.0 1.7X arrayEqual 70 / 73 150.8 6.6 1.6X Java HashMap (Long) 141 / 200 74.3 13.5 0.8X Java HashMap (two ints) 145 / 185 72.3 13.8 0.7X Java HashMap (UnsafeRow) 499 / 524 21.0 47.6 0.2X BytesToBytesMap (off Heap) 483 / 548 21.7 46.0 0.2X BytesToBytesMap (on Heap) 485 / 562 21.6 46.2 0.2X Vectorized Hashmap 54 / 60 193.7 5.2 2.0X Author: Sameer Agarwal <sameer@databricks.com> Closes #12055 from sameeragarwal/vectorized-hashmap.
* [SPARK-11892][ML] Model export/import for spark.ml: OneVsRestXusen Yin2016-03-313-18/+223
| | | | | | | | | | | | | | | | # What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-11892 Add save/load for spark ml.OneVsRest and its model. Also add OneVsRest and OneVsRestModel in MetaAlgorithmReadWrite. # How was this patch tested? Test with Scala unit test. Author: Xusen Yin <yinxusen@gmail.com> Closes #9934 from yinxusen/SPARK-11892.