aboutsummaryrefslogtreecommitdiff
path: root/examples
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-12481][CORE][STREAMING][SQL] Remove usage of Hadoop deprecated APIs ↵Sean Owen2016-01-022-6/+2
| | | | | | | | | | and reflection that supported 1.x Remove use of deprecated Hadoop APIs now that 2.2+ is required Author: Sean Owen <sowen@cloudera.com> Closes #10446 from srowen/SPARK-12481.
* [SPARK-12588] Remove HttpBroadcast in Spark 2.0.Reynold Xin2015-12-301-5/+3
| | | | | | | | We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been undocumented since then. It's time to remove it in Spark 2.0. Author: Reynold Xin <rxin@databricks.com> Closes #10531 from rxin/SPARK-12588.
* [SPARK-12429][STREAMING][DOC] Add Accumulator and Broadcast example for ↵Shixiong Zhu2015-12-223-10/+157
| | | | | | | | | | Streaming This PR adds Scala, Java and Python examples to show how to use Accumulator and Broadcast in Spark Streaming to support checkpointing. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10385 from zsxwing/accumulator-broadcast-example.
* [SPARK-11808] Remove Bagel.Reynold Xin2015-12-191-6/+0
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #10395 from rxin/SPARK-11808.
* Bump master version to 2.0.0-SNAPSHOT.Reynold Xin2015-12-191-1/+1
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #10387 from rxin/version-bump.
* [SPARK-9057][STREAMING] Twitter example joining to static RDD of word ↵Jeff L2015-12-183-0/+353
| | | | | | | | | | sentiment values Example of joining a static RDD of word sentiments to a streaming RDD of Tweets in order to demo the usage of the transform() method. Author: Jeff L <sha0lin@alumni.carnegiemellon.edu> Closes #8431 from Agent007/SPARK-9057.
* [SPARK-12410][STREAMING] Fix places that use '.' and '|' directly in splitShixiong Zhu2015-12-171-1/+1
| | | | | | | | String.split accepts a regular expression, so we should escape "." and "|". Author: Shixiong Zhu <shixiong@databricks.com> Closes #10361 from zsxwing/reg-bug.
* [SPARK-12364][ML][SPARKR] Add ML example for SparkRYanbo Liang2015-12-161-0/+54
| | | | | | | | | | We have DataFrame example for SparkR, we also need to add ML example under ```examples/src/main/r```. cc mengxr jkbradley shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10324 from yanboliang/spark-12364.
* [SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for ↵Yu ISHIKAWA2015-12-162-0/+129
| | | | | | | | | | | bisecting k-means This PR includes only an example code in order to finish it quickly. I'll send another PR for the docs soon. Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9952 from yu-iskw/SPARK-6518.
* [SPARK-12215][ML][DOC] User guide section for KMeans in spark.mlYu ISHIKAWA2015-12-162-28/+29
| | | | | | | | cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #10244 from yu-iskw/SPARK-12215.
* [MINOR][ML] Rename weights to coefficients for examples/DeveloperApiExampleYanbo Liang2015-12-152-19/+19
| | | | | | | | | | Rename ```weights``` to ```coefficients``` for examples/DeveloperApiExample. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #10280 from yanboliang/spark-coefficients.
* [SPARK-12199][DOC] Follow-up: Refine example code in ml-features.mdXusen Yin2015-12-123-4/+4
| | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-12199 Follow-up PR of SPARK-11551. Fix some errors in ml-features.md mengxr Author: Xusen Yin <yinxusen@gmail.com> Closes #10193 from yinxusen/SPARK-12199.
* [SPARK-11978][ML] Move dataset_example.py to examples/ml and rename to ↵Yanbo Liang2015-12-112-26/+38
| | | | | | | | | | | | | | dataframe_example.py Since ```Dataset``` has a new meaning in Spark 1.6, we should rename it to avoid confusion. #9873 finished the work of Scala example, here we focus on the Python one. Move dataset_example.py to ```examples/ml``` and rename to ```dataframe_example.py```. BTW, fix minor missing issues of #9873. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9957 from yanboliang/SPARK-11978.
* [SPARK-12146][SPARKR] SparkR jsonFile should support multiple input filesYanbo Liang2015-12-111-1/+1
| | | | | | | | | | | | | | | | | * ```jsonFile``` should support multiple input files, such as: ```R jsonFile(sqlContext, c(“path1”, “path2”)) # character vector as arguments jsonFile(sqlContext, “path1,path2”) ``` * Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use ```read.json``` at SparkR side. * Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still keep jsonFile test case. * If this PR is accepted, we should also make almost the same change for ```parquetFile```. cc felixcheung sun-rui shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10145 from yanboliang/spark-12146.
* [SPARK-11713] [PYSPARK] [STREAMING] Initial RDD updateStateByKey for PySparkBryan Cutler2015-12-101-1/+4
| | | | | | | | Adding ability to define an initial state RDD for use with updateStateByKey PySpark. Added unit test and changed stateful_network_wordcount example to use initial RDD. Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #10082 from BryanCutler/initial-rdd-updateStateByKey-SPARK-11713.
* [SPARK-12244][SPARK-12245][STREAMING] Rename trackStateByKey to mapWithState ↵Tathagata Das2015-12-092-14/+14
| | | | | | | | | | | | | | | | | | | and change tracking function signature SPARK-12244: Based on feedback from early users and personal experience attempting to explain it, the name trackStateByKey had two problem. "trackState" is a completely new term which really does not give any intuition on what the operation is the resultant data stream of objects returned by the function is called in docs as the "emitted" data for the lack of a better. "mapWithState" makes sense because the API is like a mapping function like (Key, Value) => T with State as an additional parameter. The resultant data stream is "mapped data". So both problems are solved. SPARK-12245: From initial experiences, not having the key in the function makes it hard to return mapped stuff, as the whole information of the records is not there. Basically the user is restricted to doing something like mapValue() instead of map(). So adding the key as a parameter. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #10224 from tdas/rename.
* [SPARK-11551][DOC] Replace example code in ml-features.md using include_exampleXusen Yin2015-12-0951-0/+2769
| | | | | | | | | PR on behalf of somideshmukh, thanks! Author: Xusen Yin <yinxusen@gmail.com> Author: somideshmukh <somilde@us.ibm.com> Closes #10219 from yinxusen/SPARK-11551.
* [SPARK-12159][ML] Add user guide section for IndexToString transformerBenFradet2015-12-083-0/+180
| | | | | | | | Documentation regarding the `IndexToString` label transformer with code snippets in Scala/Java/Python. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10166 from BenFradet/SPARK-12159.
* [SPARK-11605][MLLIB] ML 1.6 QA: API: Java compatibility, docsYuhao Yang2015-12-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-11605 Check Java compatibility for MLlib for this release. fix: 1. `StreamingTest.registerStream` needs java friendly interface. 2. `GradientBoostedTreesModel.computeInitialPredictionAndError` and `GradientBoostedTreesModel.updatePredictionError` has java compatibility issue. Mark them as `developerAPI`. TBD: [updated] no fix for now per discussion. `org.apache.spark.mllib.classification.LogisticRegressionModel` `public scala.Option<java.lang.Object> getThreshold();` has wrong return type for Java invocation. `SVMModel` has the similar issue. Yet adding a `scala.Option<java.util.Double> getThreshold()` would result in an overloading error due to the same function signature. And adding a new function with different name seems to be not necessary. cc jkbradley feynmanliang Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10102 from hhbyyh/javaAPI.
* [SPARK-10393] use ML pipeline in LDA exampleYuhao Yang2015-12-081-113/+40
| | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-10393 Since the logic of the text processing part has been moved to ML estimators/transformers, replace the related code in LDA Example with the ML pipeline. Author: Yuhao Yang <hhbyyh@gmail.com> Author: yuhaoyang <yuhao@zhanglipings-iMac.local> Closes #8551 from hhbyyh/ldaExUpdate.
* [SPARK-11551][DOC][EXAMPLE] Revert PR #10002Cheng Lian2015-12-0851-2755/+0
| | | | | | | | | | This reverts PR #10002, commit 78209b0ccaf3f22b5e2345dfb2b98edfdb746819. The original PR wasn't tested on Jenkins before being merged. Author: Cheng Lian <lian@databricks.com> Closes #10200 from liancheng/revert-pr-10002.
* [SPARK-11958][SPARK-11957][ML][DOC] SQLTransformer user guide and example codeYanbo Liang2015-12-073-0/+144
| | | | | | | | Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10006 from yanboliang/spark-11958.
* [SPARK-11551][DOC][EXAMPLE] Replace example code in ml-features.md using ↵somideshmukh2015-12-0751-0/+2755
| | | | | | | | | | | | | include_example Made new patch contaning only markdown examples moved to exmaple/folder. Ony three java code were not shfted since they were contaning compliation error ,these classes are 1)StandardScale 2)NormalizerExample 3)VectorIndexer Author: Xusen Yin <yinxusen@gmail.com> Author: somideshmukh <somilde@us.ibm.com> Closes #10002 from somideshmukh/SomilBranch1.33.
* [SPARK-11963][DOC] Add docs for QuantileDiscretizerXusen Yin2015-12-072-0/+120
| | | | | | | | https://issues.apache.org/jira/browse/SPARK-11963 Author: Xusen Yin <yinxusen@gmail.com> Closes #9962 from yinxusen/SPARK-11963.
* [SPARK-12084][CORE] Fix codes that uses ByteBuffer.array incorrectlyShixiong Zhu2015-12-041-1/+4
| | | | | | | | | | `ByteBuffer` doesn't guarantee all contents in `ByteBuffer.array` are valid. E.g, a ByteBuffer returned by `ByteBuffer.slice`. We should not use the whole content of `ByteBuffer` unless we know that's correct. This patch fixed all places that use `ByteBuffer.array` incorrectly. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10083 from zsxwing/bytebuffer-array.
* [SPARK-12112][BUILD] Upgrade to SBT 0.13.9Josh Rosen2015-12-052-4/+6
| | | | | | | | | | We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin). I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations. Author: Josh Rosen <joshrosen@databricks.com> Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
* [SPARK-6990][BUILD] Add Java linting script; fix minor warningsDmitry Erastov2015-12-048-20/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces https://github.com/apache/spark/pull/9696 Invoke Checkstyle and print any errors to the console, failing the step. Use Google's style rules modified according to https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide Some important checks are disabled (see TODOs in `checkstyle.xml`) due to multiple violations being present in the codebase. Suggest fixing those TODOs in a separate PR(s). More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/). Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles): > Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1 Also fix some of the minor violations that didn't require sweeping changes. Apologies for the previous botched PRs - I finally figured out the issue. cr: JoshRosen, pwendell > I state that the contribution is my original work, and I license the work to the project under the project's open source license. Author: Dmitry Erastov <derastov@gmail.com> Closes #9867 from dskrvk/master.
* [SPARK-11961][DOC] Add docs of ChiSqSelectorXusen Yin2015-12-012-0/+128
| | | | | | | | https://issues.apache.org/jira/browse/SPARK-11961 Author: Xusen Yin <yinxusen@gmail.com> Closes #9965 from yinxusen/SPARK-11961.
* [SPARK-11960][MLLIB][DOC] User guide for streaming testsFeynman Liang2015-11-301-0/+2
| | | | | | | | CC jkbradley mengxr josepablocam Author: Feynman Liang <feynman.liang@gmail.com> Closes #10005 from feynmanliang/streaming-test-user-guide.
* [SPARK-11975][ML] Remove duplicate mllib example (DT/RF/GBT in Java/Python)Yanbo Liang2015-11-306-692/+0
| | | | | | | | | | Remove duplicate mllib example (DT/RF/GBT in Java/Python). Since we have tutorial code for DT/RF/GBT classification/regression in Scala/Java/Python and example applications for DT/RF/GBT in Scala, so we mark these as duplicated and remove them. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9954 from yanboliang/SPARK-11975.
* [SPARK-11689][ML] Add user guide and example code for LDA under spark.mlYuhao Yang2015-11-302-0/+174
| | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-11689 Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions. Original PR is reverted due to document build error. https://github.com/apache/spark/pull/9722 mengxr feynmanliang yinxusen Sorry for the troubling. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9974 from hhbyyh/ldaMLExample.
* [SPARK-11952][ML] Remove duplicate ml examplesYanbo Liang2015-11-243-235/+0
| | | | | | | | Remove duplicate ml examples (only for ml). mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9933 from yanboliang/SPARK-11685.
* [SPARK-11920][ML][DOC] ML LinearRegression should use correct dataset in ↵Yanbo Liang2015-11-233-3/+5
| | | | | | | | | | | | examples and user guide doc ML ```LinearRegression``` use ```data/mllib/sample_libsvm_data.txt``` as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use ```data/mllib/sample_linear_regression_data.txt``` instead. The deeper causes is that ```LinearRegression``` with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at [SPARK-11918](https://issues.apache.org/jira/browse/SPARK-11918). It will confuse users if they run the example code but get exception, so we should make this change which can clearly illustrate the usage of ```LinearRegression``` algorithm. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9905 from yanboliang/spark-11920.
* [SPARK-11895][ML] rename and refactor DatasetExample under mllib/examplesXiangrui Meng2015-11-221-45/+26
| | | | | | | | | | We used the name `Dataset` to refer to `SchemaRDD` in 1.2 in ML pipelines and created this example file. Since `Dataset` has a new meaning in Spark 1.6, we should rename it to avoid confusion. This PR also removes support for dense format to simplify the example code. cc: yinxusen Author: Xiangrui Meng <meng@databricks.com> Closes #9873 from mengxr/SPARK-11895.
* Revert "[SPARK-11689][ML] Add user guide and example code for LDA under ↵Xiangrui Meng2015-11-202-171/+0
| | | | | | spark.ml" This reverts commit e359d5dcf5bd300213054ebeae9fe75c4f7eb9e7.
* [SPARK-11549][DOCS] Replace example code in mllib-evaluation-metrics.md ↵Vikas Nelamangala2015-11-2015-0/+1304
| | | | | | | | using include_example Author: Vikas Nelamangala <vikasnelamangala@Vikass-MacBook-Pro.local> Closes #9689 from vikasnp/master.
* [SPARK-11689][ML] Add user guide and example code for LDA under spark.mlYuhao Yang2015-11-202-0/+171
| | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-11689 Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9722 from hhbyyh/ldaMLExample.
* [SPARK-11816][ML] fix some style issue in ML/MLlib examplesYuhao Yang2015-11-185-3/+5
| | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-11816 Currently I only fixed some obvious comments issue like // scalastyle:off println on the bottom. Yet the style in examples is not quite consistent, like only half of the examples are with // Example usage: ./bin/run-example mllib.FPGrowthExample \, Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9808 from hhbyyh/exampleStyle.
* [SPARK-11495] Fix potential socket / file handle leaks that were found via ↵Josh Rosen2015-11-181-15/+16
| | | | | | | | | | static analysis The HP Fortify Opens Source Review team (https://www.hpfod.com/open-source-review-project) reported a handful of potential resource leaks that were discovered using their static analysis tool. We should fix the issues identified by their scan. Author: Josh Rosen <joshrosen@databricks.com> Closes #9455 from JoshRosen/fix-potential-resource-leaks.
* rmse was wrongly calculatedViveka Kulharia2015-11-181-1/+1
| | | | | | | | It was multiplying with U instaed of dividing by U Author: Viveka Kulharia <vivkul@iitk.ac.in> Closes #9771 from vivkul/patch-1.
* [SPARK-11728] Replace example code in ml-ensembles.md using include_exampleXusen Yin2015-11-1714-0/+1056
| | | | | | | | | | JIRA issue https://issues.apache.org/jira/browse/SPARK-11728. The ml-ensembles.md file contains `OneVsRestExample`. Instead of writing new code files of two `OneVsRestExample`s, I use two existing files in the examples directory, they are `OneVsRestExample.scala` and `JavaOneVsRestExample.scala`. Author: Xusen Yin <yinxusen@gmail.com> Closes #9716 from yinxusen/SPARK-11728.
* [MINOR] Correct comments in JavaDirectKafkaWordCountRohan Bhanderi2015-11-171-4/+4
| | | | | | Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu> Closes #9781 from RohanBhanderi/patch-3.
* [SPARK-11729] Replace example code in ml-linear-methods.md using include_exampleXusen Yin2015-11-178-0/+483
| | | | | | | | JIRA link: https://issues.apache.org/jira/browse/SPARK-11729 Author: Xusen Yin <yinxusen@gmail.com> Closes #9713 from yinxusen/SPARK-11729.
* [EXAMPLE][MINOR] Add missing awaitTermination in click stream examplejerryshao2015-11-161-2/+2
| | | | | | Author: jerryshao <sshao@hortonworks.com> Closes #9730 from jerryshao/clickstream-fix.
* Typo in comment: use 2 seconds instead of 1Rohan Bhanderi2015-11-141-1/+1
| | | | | | | | Use 2 seconds batch size as duration specified in JavaStreamingContext constructor is 2000 ms Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu> Closes #9714 from RohanBhanderi/patch-2.
* [SPARK-11723][ML][DOC] Use LibSVM data source rather than ↵Yanbo Liang2015-11-1321-110/+65
| | | | | | | | | | | | | | | | MLUtils.loadLibSVMFile to load DataFrame Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include: * Use libSVM data source for all example codes under examples/ml, and remove unused import. * Use libSVM data source for user guides under ml-*** which were omitted by #8697. * Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```. * Code cleanup. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9690 from yanboliang/spark-11723.
* [SPARK-11445][DOCS] Replaced example code in mllib-ensembles.md using ↵Rishabh Bhardwaj2015-11-1312-0/+873
| | | | | | | | | | | include_example I have made the required changes and tested. Kindly review the changes. Author: Rishabh Bhardwaj <rbnext29@gmail.com> Closes #9407 from rishabhbhardwaj/SPARK-11445.
* [SPARK-11629][ML][PYSPARK][DOC] Python example code for Multilayer ↵Yanbo Liang2015-11-123-0/+201
| | | | | | | | | | Perceptron Classification Add Python example code for Multilayer Perceptron Classification, and make example code in user guide document testable. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9594 from yanboliang/spark-11629.
* [SPARK-11663][STREAMING] Add Java API for trackStateByKeyShixiong Zhu2015-11-122-25/+22
| | | | | | | | | | | TODO - [x] Add Java API - [x] Add API tests - [x] Add a function test Author: Shixiong Zhu <shixiong@databricks.com> Closes #9636 from zsxwing/java-track.
* [SPARK-11290][STREAMING] Basic implementation of trackStateByKeyTathagata Das2015-11-101-15/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | Current updateStateByKey provides stateful processing in Spark Streaming. It allows the user to maintain per-key state and manage that state using an updateFunction. The updateFunction is called for each key, and it uses new data and existing state of the key, to generate an updated state. However, based on community feedback, we have learnt the following lessons. * Need for more optimized state management that does not scan every key * Need to make it easier to implement common use cases - (a) timeout of idle data, (b) returning items other than state The high level idea that of this PR * Introduce a new API trackStateByKey that, allows the user to update per-key state, and emit arbitrary records. The new API is necessary as this will have significantly different semantics than the existing updateStateByKey API. This API will have direct support for timeouts. * Internally, the system will keep the state data as a map/list within the partitions of the state RDDs. The new data RDDs will be partitioned appropriately, and for all the key-value data, it will lookup the map/list in the state RDD partition and create a new list/map of updated state data. The new state RDD partition will be created based on the update data and if necessary, with old data. Here is the detailed design doc. Please take a look and provide feedback as comments. https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em This is still WIP. Major things left to be done. - [x] Implement basic functionality of state tracking, with initial RDD and timeouts - [x] Unit tests for state tracking - [x] Unit tests for initial RDD and timeout - [ ] Unit tests for TrackStateRDD - [x] state creating, updating, removing - [ ] emitting - [ ] checkpointing - [x] Misc unit tests for State, TrackStateSpec, etc. - [x] Update docs and experimental tags Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #9256 from tdas/trackStateByKey.