aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-11723][ML][DOC] Use LibSVM data source rather than ↵Yanbo Liang2015-11-1326-130/+79
| | | | | | | | | | | | | | | | MLUtils.loadLibSVMFile to load DataFrame Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include: * Use libSVM data source for all example codes under examples/ml, and remove unused import. * Use libSVM data source for user guides under ml-*** which were omitted by #8697. * Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```. * Code cleanup. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9690 from yanboliang/spark-11723.
* [SPARK-11445][DOCS] Replaced example code in mllib-ensembles.md using ↵Rishabh Bhardwaj2015-11-1313-514/+885
| | | | | | | | | | | include_example I have made the required changes and tested. Kindly review the changes. Author: Rishabh Bhardwaj <rbnext29@gmail.com> Closes #9407 from rishabhbhardwaj/SPARK-11445.
* [SPARK-11678][SQL] Partition discovery should stop at the root path of the ↵Yin Huai2015-11-1310-51/+235
| | | | | | | | | | | | table. https://issues.apache.org/jira/browse/SPARK-11678 The change of this PR is to pass root paths of table to the partition discovery logic. So, the process of partition discovery stops at those root paths instead of going all the way to the root path of the file system. Author: Yin Huai <yhuai@databricks.com> Closes #9651 from yhuai/SPARK-11678.
* [SPARK-11706][STREAMING] Fix the bug that Streaming Python tests cannot ↵Shixiong Zhu2015-11-135-18/+30
| | | | | | | | | | report failures This PR just checks the test results and returns 1 if the test fails, so that `run-tests.py` can mark it fail. Author: Shixiong Zhu <shixiong@databricks.com> Closes #9669 from zsxwing/streaming-python-tests.
* [SPARK-8029] Robust shuffle writerDavies Liu2015-11-1216-52/+402
| | | | | | | | | | Currently, all the shuffle writer will write to target path directly, the file could be corrupted by other attempt of the same partition on the same executor. They should write to temporary file then rename to target path, as what we do in output committer. In order to make the rename atomic, the temporary file should be created in the same local directory (FileSystem). This PR is based on #9214 , thanks to squito . Closes #9214 Author: Davies Liu <davies@databricks.com> Closes #9610 from davies/safe_shuffle.
* [SPARK-11629][ML][PYSPARK][DOC] Python example code for Multilayer ↵Yanbo Liang2015-11-124-66/+206
| | | | | | | | | | Perceptron Classification Add Python example code for Multilayer Perceptron Classification, and make example code in user guide document testable. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9594 from yanboliang/spark-11629.
* [SPARK-11717] Ignore R session and history files from gitLewuathe2015-11-121-0/+4
| | | | | | | | | | | see: https://issues.apache.org/jira/browse/SPARK-11717 SparkR generates R session data and history files under current directory. It might be useful to ignore these files even running SparkR on spark directory for test or development. Author: Lewuathe <lewuathe@me.com> Closes #9681 from Lewuathe/SPARK-11717.
* [SPARK-11263][SPARKR] lintr Throws Warnings on Commented Code in Documentationfelixcheung2015-11-128-1512/+1539
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean out hundreds of `style: Commented code should be removed.` from lintr Like these: ``` /opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:513:3: style: Commented code should be removed. # sc <- sparkR.init() ^~~~~~~~~~~~~~~~~~~ /opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:514:3: style: Commented code should be removed. # sqlContext <- sparkRSQL.init(sc) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /opt/spark-1.6.0-bin-hadoop2.6/R/pkg/R/DataFrame.R:515:3: style: Commented code should be removed. # path <- "path/to/file.json" ^~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` tried without export or rdname, neither work instead, added this `#' noRd` to suppress .Rd file generation also updated `family` for DataFrame functions for longer descriptive text instead of `dataframe_funcs` ![image](https://cloud.githubusercontent.com/assets/8969467/10933937/17bf5b1e-8291-11e5-9777-40fc632105dc.png) this covers *most* of 'Commented code' but I left out a few that looks legitimate. Author: felixcheung <felixcheung_m@hotmail.com> Closes #9463 from felixcheung/rlintr.
* [SPARK-11672][ML] flaky spark.ml read/write testsXiangrui Meng2015-11-125-5/+7
| | | | | | | | | | We set `sqlContext = null` in `afterAll`. However, this doesn't change `SQLContext.activeContext` and then `SQLContext.getOrCreate` might use the `SparkContext` from previous test suite and hence causes the error. This PR calls `clearActive` in `beforeAll` and `afterAll` to avoid using an old context from other test suites. cc: yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #9677 from mengxr/SPARK-11672.2.
* [SPARK-11681][STREAMING] Correctly update state timestamp even when state is ↵Tathagata Das2015-11-122-49/+192
| | | | | | | | | | | | | not updated Bug: Timestamp is not updated if there is data but the corresponding state is not updated. This is wrong, and timeout is defined as "no data for a while", not "not state update for a while". Fix: Update timestamp when timestamp when timeout is specified, otherwise no need. Also refactored the code for better testability and added unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #9648 from tdas/SPARK-11681.
* [SPARK-11419][STREAMING] Parallel recovery for FileBasedWriteAheadLog + ↵Burak Yavuz2015-11-127-37/+268
| | | | | | | | | | | | | | minor recovery tweaks The support for closing WriteAheadLog files after writes was just merged in. Closing every file after a write is a very expensive operation as it creates many small files on S3. It's not necessary to enable it on HDFS anyway. However, when you have many small files on S3, recovery takes very long. In addition, files start stacking up pretty quickly, and deletes may not be able to keep up, therefore deletes can also be parallelized. This PR adds support for the two parallelization steps mentioned above, in addition to a couple more failures I encountered during recovery. Author: Burak Yavuz <brkyvz@gmail.com> Closes #9373 from brkyvz/par-recovery.
* [SPARK-11663][STREAMING] Add Java API for trackStateByKeyShixiong Zhu2015-11-1212-52/+485
| | | | | | | | | | | TODO - [x] Add Java API - [x] Add API tests - [x] Add a function test Author: Shixiong Zhu <shixiong@databricks.com> Closes #9636 from zsxwing/java-track.
* [SPARK-11654][SQL] add reduce to GroupedDatasetMichael Armbrust2015-11-1215-197/+309
| | | | | | | | | | | | | | | | | | | | | | | | This PR adds a new method, `reduce`, to `GroupedDataset`, which allows similar operations to `reduceByKey` on a traditional `PairRDD`. ```scala val ds = Seq("abc", "xyz", "hello").toDS() ds.groupBy(_.length).reduce(_ + _).collect() // not actually commutative :P res0: Array(3 -> "abcxyz", 5 -> "hello") ``` While implementing this method and its test cases several more deficiencies were found in our encoder handling. Specifically, in order to support positional resolution, named resolution and tuple composition, it is important to keep the unresolved encoder around and to use it when constructing new `Datasets` with the same object type but different output attributes. We now divide the encoder lifecycle into three phases (that mirror the lifecycle of standard expressions) and have checks at various boundaries: - Unresoved Encoders: all users facing encoders (those constructed by implicits, static methods, or tuple composition) are unresolved, meaning they have only `UnresolvedAttributes` for named fields and `BoundReferences` for fields accessed by ordinal. - Resolved Encoders: internal to a `[Grouped]Dataset` the encoder is resolved, meaning all input has been resolved to a specific `AttributeReference`. Any encoders that are placed into a logical plan for use in object construction should be resolved. - BoundEncoder: Are constructed by physical plans, right before actual conversion from row -> object is performed. It is left to future work to add explicit checks for resolution and provide good error messages when it fails. We might also consider enforcing the above constraints in the type system (i.e. `fromRow` only exists on a `ResolvedEncoder`), but we should probably wait before spending too much time on this. Author: Michael Armbrust <michael@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #9673 from marmbrus/pr/9628.
* [SPARK-11712][ML] Make spark.ml LDAModel be abstractJoseph K. Bradley2015-11-122-88/+96
| | | | | | | | | | Per discussion in the initial Pipelines LDA PR [https://github.com/apache/spark/pull/9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases. CC feynmanliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9678 from jkbradley/lda-pipelines-2.
* [SPARK-11709] include creation site info in SparkContext.assertNotStopped ↵Xiangrui Meng2015-11-121-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | error message This helps debug issues caused by multiple SparkContext instances. JoshRosen andrewor14 ~~~ scala> sc.stop() scala> sc.parallelize(0 until 10) java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext. This stopped SparkContext was created at: org.apache.spark.SparkContext.<init>(SparkContext.scala:82) org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017) $iwC$$iwC.<init>(<console>:9) $iwC.<init>(<console>:18) <init>(<console>:20) .<init>(<console>:24) .<clinit>(<console>) .<init>(<console>:7) .<clinit>(<console>) $print(<console>) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:606) org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) The active context was created at: (No active SparkContext.) ~~~ Author: Xiangrui Meng <meng@databricks.com> Closes #9675 from mengxr/SPARK-11709.
* [SPARK-11658] simplify documentation for PySpark combineByKeyChris Snow2015-11-121-1/+0
| | | | | | Author: Chris Snow <chsnow123@gmail.com> Closes #9640 from snowch/patch-3.
* [SPARK-11667] Update dynamic allocation docs to reflect supported cluster ↵Andrew Or2015-11-121-28/+27
| | | | | | | | managers Author: Andrew Or <andrew@databricks.com> Closes #9637 from andrewor14/update-da-docs.
* [SPARK-11670] Fix incorrect kryo buffer default value in docsAndrew Or2015-11-121-2/+2
| | | | | | | | <img width="931" alt="screen shot 2015-11-11 at 1 53 21 pm" src="https://cloud.githubusercontent.com/assets/2133137/11108261/35d183d4-889a-11e5-9572-85e9d6cebd26.png"> Author: Andrew Or <andrew@databricks.com> Closes #9638 from andrewor14/fix-kryo-docs.
* [SPARK-2533] Add locality levels on stage summary viewJean-Baptiste Onofré2015-11-121-1/+20
| | | | | | Author: Jean-Baptiste Onofré <jbonofre@apache.org> Closes #9487 from jbonofre/SPARK-2533-2.
* [SPARK-11671] documentation code example typoChris Snow2015-11-121-1/+1
| | | | | | | | Example for sqlContext.createDataDrame from pandas.DataFrame has a typo Author: Chris Snow <chsnow123@gmail.com> Closes #9639 from snowch/patch-2.
* [SPARK-11290][STREAMING][TEST-MAVEN] Fix the test for maven buildShixiong Zhu2015-11-121-3/+9
| | | | | | | | Should not create SparkContext in the constructor of `TrackStateRDDSuite`. This is a follow up PR for #9256 to fix the test for maven build. Author: Shixiong Zhu <shixiong@databricks.com> Closes #9668 from zsxwing/hotfix.
* [SPARK-11655][CORE] Fix deadlock in handling of launcher stop().Marcelo Vanzin2015-11-124-13/+39
| | | | | | | | | | | | | | | | The stop() callback was trying to close the launcher connection in the same thread that handles connection data, which ended up causing a deadlock. So avoid that by dispatching the stop() request in its own thread. On top of that, add some exception safety to a few parts of the code, and use "destroyForcibly" from Java 8 if it's available, to force kill the child process. The flip side is that "kill()" may not actually work if running Java 7. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9633 from vanzin/SPARK-11655.
* [SPARK-11420] Updating Stddev support via Imperative AggregateJihongMa2015-11-1210-115/+52
| | | | | | | | switched stddev support from DeclarativeAggregate to ImperativeAggregate. Author: JihongMa <linlin200605@gmail.com> Closes #9380 from JihongMA/SPARK-11420.
* [SPARK-10113][SQL] Explicit error message for unsigned Parquet logical typeshyukjinkwon2015-11-122-0/+31
| | | | | | | | Parquet supports some unsigned datatypes. However, Since Spark does not support unsigned datatypes, it needs to emit an exception with a clear message rather then with the one saying illegal datatype. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9646 from HyukjinKwon/SPARK-10113.
* [SPARK-11191][SQL] Looks up temporary function using execution Hive clientCheng Lian2015-11-123-5/+56
| | | | | | | | When looking up Hive temporary functions, we should always use the `SessionState` within the execution Hive client, since temporary functions are registered there. Author: Cheng Lian <lian@databricks.com> Closes #9664 from liancheng/spark-11191.fix-temp-function.
* Fixed error in scaladoc of convertToCanonicalEdgesGaurav Kumar2015-11-121-1/+1
| | | | | | | | The code convertToCanonicalEdges is such that srcIds are smaller than dstIds but the scaladoc suggested otherwise. Have fixed the same. Author: Gaurav Kumar <gauravkumar37@gmail.com> Closes #9666 from gauravkumar37/patch-1.
* [BUILD][MINOR] Remove non-exist yarnStable module in Sbt projectjerryshao2015-11-121-4/+2
| | | | | | | | Remove some old yarn related building codes, please review, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #9625 from jerryshao/remove-old-module.
* [SPARK-11673][SQL] Remove the normal Project physical operator (and keep ↵Reynold Xin2015-11-1227-287/+80
| | | | | | | | | | TungstenProject) Also make full outer join being able to produce UnsafeRows. Author: Reynold Xin <rxin@databricks.com> Closes #9643 from rxin/SPARK-11673.
* [SPARK-11661][SQL] Still pushdown filters returned by unhandledFilters.Yin Huai2015-11-125-24/+71
| | | | | | | | https://issues.apache.org/jira/browse/SPARK-11661 Author: Yin Huai <yhuai@databricks.com> Closes #9634 from yhuai/unhandledFilters.
* [SPARK-11674][ML] add private val after @transient in Word2VecModelXiangrui Meng2015-11-111-1/+1
| | | | | | | | This causes compile failure with Scala 2.11. See https://issues.scala-lang.org/browse/SI-8813. (Jenkins won't test Scala 2.11. I tested compile locally.) JoshRosen Author: Xiangrui Meng <meng@databricks.com> Closes #9644 from mengxr/SPARK-11674.
* [SPARK-11396] [SQL] add native implementation of datetime function ↵Daoyuan Wang2015-11-114-5/+77
| | | | | | | | | | | | to_unix_timestamp `to_unix_timestamp` is the deterministic version of `unix_timestamp`, as it accepts at least one parameters. Since the behavior here is quite similar to `unix_timestamp`, I think the dataframe API is not necessary here. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9347 from adrian-wang/to_unix_timestamp.
* [SPARK-11675][SQL] Remove shuffle hash joins.Reynold Xin2015-11-1112-717/+357
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9645 from rxin/SPARK-11675.
* [SPARK-8992][SQL] Add pivot to dataframe apiAndrew Ray2015-11-116-10/+255
| | | | | | | | | | | | | | | | | | | | | | | This adds a pivot method to the dataframe api. Following the lead of cube and rollup this adds a Pivot operator that is translated into an Aggregate by the analyzer. Currently the syntax is like: ~~courseSales.pivot(Seq($"year"), $"course", Seq("dotNET", "Java"), sum($"earnings"))~~ ~~Would we be interested in the following syntax also/alternatively? and~~ courseSales.groupBy($"year").pivot($"course", "dotNET", "Java").agg(sum($"earnings")) //or courseSales.groupBy($"year").pivot($"course").agg(sum($"earnings")) Later we can add it to `SQLParser`, but as Hive doesn't support it we cant add it there, right? ~~Also what would be the suggested Java friendly method signature for this?~~ Author: Andrew Ray <ray.andrew@gmail.com> Closes #7841 from aray/sql-pivot.
* [SPARK-11672][ML] disable spark.ml read/write testsXiangrui Meng2015-11-114-5/+5
| | | | | | | | | | | | Saw several failures on Jenkins, e.g., https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/. This is the first failure in master build: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3982/ I cannot reproduce it on local. So temporarily disable the tests and I will look into the issue under the same JIRA. I'm going to merge the PR after Jenkins passes compile. Author: Xiangrui Meng <meng@databricks.com> Closes #9641 from mengxr/SPARK-11672.
* [SPARK-10827] replace volatile with Atomic* in AppClient.scala.Reynold Xin2015-11-111-33/+35
| | | | | | | | This is a followup for #9317 to replace volatile fields with AtomicBoolean and AtomicReference. Author: Reynold Xin <rxin@databricks.com> Closes #9611 from rxin/SPARK-10827.
* [SPARK-11647] Attempt to reduce time/flakiness of Thriftserver CLI and ↵Josh Rosen2015-11-114-18/+38
| | | | | | | | | | | | | | | | SparkSubmit tests This patch aims to reduce the test time and flakiness of HiveSparkSubmitSuite, SparkSubmitSuite, and CliSuite. Key changes: - Disable IO synchronization calls for Derby writes, since durability doesn't matter for tests. This was done for HiveCompatibilitySuite in #6651 and resulted in huge test speedups. - Add a few missing `--conf`s to disable various Spark UIs. The CliSuite, in particular, never disabled these UIs, leaving it prone to port-contention-related flakiness. - Fix two instances where tests defined `beforeAll()` methods which were never called because the appropriate traits were not mixed in. I updated these tests suites to extend `BeforeAndAfterEach` so that they play nicely with our `ResetSystemProperties` trait. Author: Josh Rosen <joshrosen@databricks.com> Closes #9623 from JoshRosen/SPARK-11647.
* [SPARK-11335][STREAMING] update kafka direct python docs on how to get the ↵Nick Evans2015-11-111-1/+14
| | | | | | | | | | | | offset ranges for a KafkaRDD tdas koeninger This updates the Spark Streaming + Kafka Integration Guide doc with a working method to access the offsets of a `KafkaRDD` through Python. Author: Nick Evans <me@nicolasevans.org> Closes #9289 from manygrams/update_kafka_direct_python_docs.
* [SPARK-11645][SQL] Remove OpenHashSet for the old aggregate.Reynold Xin2015-11-115-316/+5
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9621 from rxin/SPARK-11645.
* [SPARK-11644][SQL] Remove the option to turn off unsafe and codegen.Reynold Xin2015-11-1127-494/+257
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9618 from rxin/SPARK-11644.
* [SPARK-11639][STREAMING][FLAKY-TEST] Implement BlockingWriteAheadLog for ↵Burak Yavuz2015-11-112-47/+80
| | | | | | | | | | | | testing the BatchedWriteAheadLog Several elements could be drained if the main thread is not fast enough. zsxwing warned me about a similar problem, but missed it here :( Submitting the fix using a waiter. cc tdas Author: Burak Yavuz <brkyvz@gmail.com> Closes #9605 from brkyvz/fix-flaky-test.
* [SPARK-6152] Use shaded ASM5 to support closure cleaning of Java 8 compiled ↵Josh Rosen2015-11-1112-57/+125
| | | | | | | | | | | | | | | | classes This patch modifies Spark's closure cleaner (and a few other places) to use ASM 5, which is necessary in order to support cleaning of closures that were compiled by Java 8. In order to avoid ASM dependency conflicts, Spark excludes ASM from all of its dependencies and uses a shaded version of ASM 4 that comes from `reflectasm` (see [SPARK-782](https://issues.apache.org/jira/browse/SPARK-782) and #232). This patch updates Spark to use a shaded version of ASM 5.0.4 that was published by the Apache XBean project; the POM used to create the shaded artifact can be found at https://github.com/apache/geronimo-xbean/blob/xbean-4.4/xbean-asm5-shaded/pom.xml. http://movingfulcrum.tumblr.com/post/80826553604/asm-framework-50-the-missing-migration-guide was a useful resource while upgrading the code to use the new ASM5 opcodes. I also added a new regression tests in the `java8-tests` subproject; the existing tests were insufficient to catch this bug, which only affected Scala 2.11 user code which was compiled targeting Java 8. Author: Josh Rosen <joshrosen@databricks.com> Closes #9512 from JoshRosen/SPARK-6152.
* [SQL][MINOR] remove newLongEncoder in functionsWenchen Fan2015-11-111-4/+2
| | | | | | | | it may shadows the one from implicits in some case. Author: Wenchen Fan <wenchen@databricks.com> Closes #9629 from cloud-fan/minor.
* [SPARK-11564][SQL][FOLLOW-UP] clean up java tuple encoderWenchen Fan2015-11-1114-113/+65
| | | | | | | | | | | We need to support custom classes like java beans and combine them into tuple, and it's very hard to do it with the TypeTag-based approach. We should keep only the compose-based way to create tuple encoder. This PR also move `Encoder` to `org.apache.spark.sql` Author: Wenchen Fan <wenchen@databricks.com> Closes #9567 from cloud-fan/java.
* [SPARK-11656][SQL] support typed aggregate in project listWenchen Fan2015-11-112-4/+27
| | | | | | | | insert `aEncoder` like we do in `agg` Author: Wenchen Fan <wenchen@databricks.com> Closes #9630 from cloud-fan/select.
* [SQL][MINOR] rename present to finish in AggregatorWenchen Fan2015-11-113-5/+5
| | | | | | Author: Wenchen Fan <wenchen@databricks.com> Closes #9617 from cloud-fan/tmp.
* [SPARK-11646] WholeTextFileRDD should return Text rather than StringReynold Xin2015-11-115-44/+69
| | | | | | | | If it returns Text, we can reuse this in Spark SQL to provide a WholeTextFile data source and directly convert the Text into UTF8String without extra string decoding and encoding. Author: Reynold Xin <rxin@databricks.com> Closes #9622 from rxin/SPARK-11646.
* [SPARK-11626][ML] ml.feature.Word2Vec.transform() function very slowYuming Wang2015-11-111-18/+16
| | | | | | | | | | org.apache.spark.ml.feature.Word2Vec.transform() very slow. we should not read broadcast every sentence. Author: Yuming Wang <q79969786@gmail.com> Author: yuming.wang <q79969786@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #9592 from 979969786/master.
* [SPARK-10371][SQL][FOLLOW-UP] fix code styleWenchen Fan2015-11-113-33/+30
| | | | | | Author: Wenchen Fan <wenchen@databricks.com> Closes #9627 from cloud-fan/follow.
* [SPARK-11500][SQL] Not deterministic order of columns when using merging ↵hyukjinkwon2015-11-113-17/+55
| | | | | | | | | | | | | | | | | | | | | | schemas. https://issues.apache.org/jira/browse/SPARK-11500 As filed in SPARK-11500, if merging schemas is enabled, the order of files to touch is a matter which might affect the ordering of the output columns. This was mostly because of the use of `Set` and `Map` so I replaced them to `LinkedHashSet` and `LinkedHashMap` to keep the insertion order. Also, I changed `reduceOption` to `reduceLeftOption`, and replaced the order of `filesToTouch` from `metadataStatuses ++ commonMetadataStatuses ++ needMerged` to `needMerged ++ metadataStatuses ++ commonMetadataStatuses` in order to touch the part-files first which always have the schema in footers whereas the others might not exist. One nit is, If merging schemas is not enabled, but when multiple files are given, there is no guarantee of the output order, since there might not be a summary file for the first file, which ends up putting ahead the columns of the other files. However, I thought this should be okay since disabling merging schemas means (assumes) all the files have the same schemas. In addition, in the test code for this, I only checked the names of fields. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9517 from HyukjinKwon/SPARK-11500.
* [SPARK-11290][STREAMING] Basic implementation of trackStateByKeyTathagata Das2015-11-1010-19/+2125
| | | | | | | | | | | | | | | | | | | | | | | | | | | Current updateStateByKey provides stateful processing in Spark Streaming. It allows the user to maintain per-key state and manage that state using an updateFunction. The updateFunction is called for each key, and it uses new data and existing state of the key, to generate an updated state. However, based on community feedback, we have learnt the following lessons. * Need for more optimized state management that does not scan every key * Need to make it easier to implement common use cases - (a) timeout of idle data, (b) returning items other than state The high level idea that of this PR * Introduce a new API trackStateByKey that, allows the user to update per-key state, and emit arbitrary records. The new API is necessary as this will have significantly different semantics than the existing updateStateByKey API. This API will have direct support for timeouts. * Internally, the system will keep the state data as a map/list within the partitions of the state RDDs. The new data RDDs will be partitioned appropriately, and for all the key-value data, it will lookup the map/list in the state RDD partition and create a new list/map of updated state data. The new state RDD partition will be created based on the update data and if necessary, with old data. Here is the detailed design doc. Please take a look and provide feedback as comments. https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em This is still WIP. Major things left to be done. - [x] Implement basic functionality of state tracking, with initial RDD and timeouts - [x] Unit tests for state tracking - [x] Unit tests for initial RDD and timeout - [ ] Unit tests for TrackStateRDD - [x] state creating, updating, removing - [ ] emitting - [ ] checkpointing - [x] Misc unit tests for State, TrackStateSpec, etc. - [x] Update docs and experimental tags Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #9256 from tdas/trackStateByKey.