aboutsummaryrefslogtreecommitdiff
path: root/mllib/src
Commit message (Collapse)AuthorAgeFilesLines
* [MINOR][MLLIB] move setCheckpointInterval to non-expert settersXiangrui Meng2016-06-211-1/+1
| | | | | | | | | | ## What changes were proposed in this pull request? The `checkpointInterval` is a non-expert param. This PR moves its setter to non-expert group. Author: Xiangrui Meng <meng@databricks.com> Closes #13813 from mengxr/checkpoint-non-expert.
* [SPARK-15177][.1][R] make SparkR model params and default values consistent ↵Xiangrui Meng2016-06-212-6/+6
| | | | | | | | | | | | | | | | | | | | | with MLlib ## What changes were proposed in this pull request? This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation. Main changes: * `spark.glm`: epsilon -> tol, maxit -> maxIter * `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means||" * `spark.naiveBayes`: laplace -> smoothing, default 1.0 ## How was this patch tested? Existing unit tests. Author: Xiangrui Meng <meng@databricks.com> Closes #13801 from mengxr/SPARK-15177.1.
* [SPARK-10258][DOC][ML] Add @Since annotations to ml.featureNick Pentreath2016-06-2127-63/+357
| | | | | | | | | | | | | | This PR adds missing `Since` annotations to `ml.feature` package. Closes #8505. ## How was this patch tested? Existing tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13641 from MLnick/add-since-annotations.
* [SPARK-16074][MLLIB] expose VectorUDT/MatrixUDT in a public APIXiangrui Meng2016-06-203-0/+93
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself is private in Spark. However, in order to let developers implement their own transformers and estimators, we should expose both types in a public API to simply the implementation of transformSchema, transform, etc. Otherwise, they need to get the data types using reflection. ## How was this patch tested? Unit tests in Scala and Java. Author: Xiangrui Meng <meng@databricks.com> Closes #13789 from mengxr/SPARK-16074.
* [SPARK-15946][MLLIB] Conversion between old/new vector columns in a ↵Xiangrui Meng2016-06-171-0/+14
| | | | | | | | | | | | | | | | | | DataFrame (Python) ## What changes were proposed in this pull request? This PR implements python wrappers for #13662 to convert old/new vector columns in a DataFrame. ## How was this patch tested? doctest in Python cc: yanboliang Author: Xiangrui Meng <meng@databricks.com> Closes #13731 from mengxr/SPARK-15946.
* [SPARK-16008][ML] Remove unnecessary serialization in logistic regressionsethah2016-06-171-28/+29
| | | | | | | | | | | | | | | | | | | | | JIRA: [SPARK-16008](https://issues.apache.org/jira/browse/SPARK-16008) ## What changes were proposed in this pull request? `LogisticAggregator` stores references to two arrays of dimension `numFeatures` which are serialized before the combine op, unnecessarily. This results in the shuffle write being ~3x (for multiclass logistic regression, this number will go up) larger than it should be (in MLlib, for instance, it is 3x smaller). This patch modifies `LogisticAggregator.add` to accept the two arrays as method parameters which avoids the serialization. ## How was this patch tested? I tested this locally and verified the serialization reduction. ![image](https://cloud.githubusercontent.com/assets/7275795/16140387/d2974bac-3404-11e6-94f9-268860c931a2.png) Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup. Author: sethah <seth.hendrickson16@gmail.com> Closes #13729 from sethah/lr_improvement.
* [SPARK-15922][MLLIB] `toIndexedRowMatrix` should consider the case `cols < ↵Dongjoon Hyun2016-06-162-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | offset+colsPerBlock` ## What changes were proposed in this pull request? SPARK-15922 reports the following scenario throwing an exception due to the mismatched vector sizes. This PR handles the exceptional case, `cols < (offset + colsPerBlock)`. **Before** ```scala scala> import org.apache.spark.mllib.linalg.distributed._ scala> import org.apache.spark.mllib.linalg._ scala> val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil scala> val rdd = sc.parallelize(rows) scala> val matrix = new IndexedRowMatrix(rdd, 3, 3) scala> val bmat = matrix.toBlockMatrix scala> val imat = bmat.toIndexedRowMatrix scala> imat.rows.collect ... // java.lang.IllegalArgumentException: requirement failed: Vectors must be the same length! ``` **After** ```scala ... scala> imat.rows.collect res0: Array[org.apache.spark.mllib.linalg.distributed.IndexedRow] = Array(IndexedRow(0,[1.0,2.0,3.0]), IndexedRow(1,[1.0,2.0,3.0]), IndexedRow(2,[1.0,2.0,3.0])) ``` ## How was this patch tested? Pass the Jenkins tests (including the above case) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13643 from dongjoon-hyun/SPARK-15922.
* [SPARK-15983][SQL] Removes FileFormat.prepareReadCheng Lian2016-06-161-16/+17
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source. However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean. ## How was this patch tested? Existing tests. Author: Cheng Lian <lian@databricks.com> Closes #13698 from liancheng/remove-prepare-read.
* [SPARK-15979][SQL] Rename various Parquet support classes.Reynold Xin2016-06-151-5/+1
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch renames various Parquet support classes from CatalystAbc to ParquetAbc. This new naming makes more sense for two reasons: 1. These are not optimizer related (i.e. Catalyst) classes. 2. We are in the Spark code base, and as a result it'd be more clear to call out these are Parquet support classes, rather than some Spark classes. ## How was this patch tested? Renamed test cases as well. Author: Reynold Xin <rxin@databricks.com> Closes #13696 from rxin/parquet-rename.
* [DOCS] Fix Gini and Entropy scaladocs in context of multiclass classificationWojciech Jurczyk2016-06-152-3/+2
| | | | | | | | The PR changes outdated scaladocs for Gini and Entropy classes. Since PR #886 Spark supports multiclass classification, but the docs tell only about binary classification. Author: Wojciech Jurczyk <wojciech.jurczyk@codilime.com> Closes #11252 from wjur/wjur/docs_multiclass.
* [SPARK-15945][MLLIB] Conversion between old/new vector columns in a ↵Xiangrui Meng2016-06-143-8/+218
| | | | | | | | | | | | | | | | | | | | DataFrame (Scala/Java) ## What changes were proposed in this pull request? This PR provides conversion utils between old/new vector columns in a DataFrame. So users can use it to migrate their datasets and pipelines manually. The methods are implemented under `MLUtils` and called `convertVectorColumnsToML` and `convertVectorColumnsFromML`. Both take a DataFrame and a list of vector columns to be converted. It is a no-op on vector columns that are already converted. A warning message is logged if actual conversion happens. This is the first sub-task under SPARK-15944 to make it easier to migrate existing pipelines to Spark 2.0. ## How was this patch tested? Unit tests in Scala and Java. cc: yanboliang Author: Xiangrui Meng <meng@databricks.com> Closes #13662 from mengxr/SPARK-15945.
* [SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ↵Liang-Chi Hsieh2016-06-133-241/+364
| | | | | | | | | | | | | | | ml.Matrix under spark.ml.python ## What changes were proposed in this pull request? Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13219 from viirya/pyspark-pickler-ml.
* [SPARK-15892][ML] Incorrectly merged AFTAggregator with zero total counthyukjinkwon2016-06-122-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, `AFTAggregator` is not being merged correctly. For example, if there is any single empty partition in the data, this creates an `AFTAggregator` with zero total count which causes the exception below: ``` IllegalArgumentException: u'requirement failed: The number of instances should be greater than 0.0, but got 0.' ``` Please see [AFTSurvivalRegression.scala#L573-L575](https://github.com/apache/spark/blob/6ecedf39b44c9acd58cdddf1a31cf11e8e24428c/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala#L573-L575) as well. Just to be clear, the python example `aft_survival_regression.py` seems using 5 rows. So, if there exist partitions more than 5, it throws the exception above since it contains empty partitions which results in an incorrectly merged `AFTAggregator`. Executing `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py` on a machine with CPUs more than 5 is being failed because it creates tasks with some empty partitions with defualt configurations (AFAIK, it sets the parallelism level to the number of CPU cores). ## How was this patch tested? An unit test in `AFTSurvivalRegressionSuite.scala` and manually tested by `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py`. Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #13619 from HyukjinKwon/SPARK-15892.
* [SPARK-15654] [SQL] fix non-splitable files for text based file formatsDavies Liu2016-06-101-1/+1
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, we always split the files when it's bigger than maxSplitBytes, but Hadoop LineRecordReader does not respect the splits for compressed files correctly, we should have a API for FileFormat to check whether the file could be splitted or not. This PR is based on #13442, closes #13442 ## How was this patch tested? add regression tests. Author: Davies Liu <davies@databricks.com> Closes #13531 from davies/fix_split.
* [SPARK-15875] Try to use Seq.isEmpty and Seq.nonEmpty instead of Seq.length ↵wangyang2016-06-101-1/+1
| | | | | | | | | | | | | | | | == 0 and Seq.length > 0 ## What changes were proposed in this pull request? In scala, immutable.List.length is an expensive operation so we should avoid using Seq.length == 0 or Seq.lenth > 0, and use Seq.isEmpty and Seq.nonEmpty instead. ## How was this patch tested? existing tests Author: wangyang <wangyang@haizhi.com> Closes #13601 from yangw1234/isEmpty.
* [SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula __str__ method similar ↵Bryan Cutler2016-06-102-2/+14
| | | | | | | | | | | | | | to Scala API ## What changes were proposed in this pull request? Adding __str__ to RFormula and model that will show the set formula param and resolved formula. This is currently present in the Scala API, found missing in PySpark during Spark 2.0 coverage review. ## How was this patch tested? run pyspark-ml tests locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #13481 from BryanCutler/pyspark-ml-rformula_str-SPARK-15738.
* [SPARK-15793][ML] Add maxSentenceLength for ml.Word2Vecyinxusen2016-06-082-0/+20
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-15793 Word2vec in ML package should have maxSentenceLength method for feature parity. ## How was this patch tested? Tested with Spark unit test. Author: yinxusen <yinxusen@gmail.com> Closes #13536 from yinxusen/SPARK-15793.
* [SPARK-13590][ML][DOC] Document spark.ml LiR, LoR and AFTSurvivalRegression ↵Yanbo Liang2016-06-073-1/+22
| | | | | | | | | | | | | | | | | | | | behavior difference ## What changes were proposed in this pull request? When fitting ```LinearRegressionModel```(by "l-bfgs" solver) and ```LogisticRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce same model as R glmnet but different from LIBSVM. When fitting ```AFTSurvivalRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce different model compared with R survival::survreg. We should output a warning message and clarify in document for this condition. ## How was this patch tested? Document change, no unit test. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #12731 from yanboliang/spark-13590.
* [SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable publicJoseph K. Bradley2016-06-061-3/+41
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Made DefaultParamsReadable, DefaultParamsWritable public. Also added relevant doc and annotations. Added UnaryTransformerExample to demonstrate use of UnaryTransformer and DefaultParamsReadable,Writable. ## How was this patch tested? Wrote example making use of the now-public APIs. Compiled and ran locally Author: Joseph K. Bradley <joseph@databricks.com> Closes #13461 from jkbradley/defaultparamswritable.
* [MINOR] Fix Typos 'an -> a'Zheng RuiFeng2016-06-069-10/+10
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? `an -> a` Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one. ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13515 from zhengruifeng/an_a.
* [SPARK-15748][SQL] Replace inefficient foldLeft() call with flatMap() in ↵Josh Rosen2016-06-051-1/+1
| | | | | | | | | | | | PartitionStatistics `PartitionStatistics` uses `foldLeft` and list concatenation (`++`) to flatten an iterator of lists, but this is extremely inefficient compared to simply doing `flatMap`/`flatten` because it performs many unnecessary object allocations. Simply replacing this `foldLeft` by a `flatMap` results in decent performance gains when constructing PartitionStatistics instances for tables with many columns. This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern. Author: Josh Rosen <joshrosen@databricks.com> Closes #13491 from JoshRosen/foldleft-to-flatmap.
* [SPARK-15770][ML] Annotation audit for Experimental and DeveloperApiZheng RuiFeng2016-06-0513-3/+50
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1, remove comments `:: Experimental ::` for non-experimental API 2, add comments `:: Experimental ::` for experimental API 3, add comments `:: DeveloperApi ::` for developerApi API ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13514 from zhengruifeng/del_experimental.
* [SPARK-15617][ML][DOC] Clarify that fMeasure in MulticlassMetrics is "micro" ↵Ruifeng Zheng2016-06-042-8/+6
| | | | | | | | | | | | | | | f1_score ## What changes were proposed in this pull request? 1, del precision,recall in `ml.MulticlassClassificationEvaluator` 2, update user guide for `mlllib.weightedFMeasure` ## How was this patch tested? local build Author: Ruifeng Zheng <ruifengz@foxmail.com> Closes #13390 from zhengruifeng/clarify_f1.
* [SPARK-15494][SQL] encoder code cleanupWenchen Fan2016-06-031-1/+1
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions. 1. move validation logic to analyzer instead of encoder 2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore. 3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework. 4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups) ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #13269 from cloud-fan/clean-encoder.
* [SPARK-15740][MLLIB] ignore big model load / save in Word2VecSuiteXiangrui Meng2016-06-021-1/+1
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? andrewor14 noticed some OOM errors caused by "test big model load / save" in Word2VecSuite, e.g., https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. It doesn't show up in the test result because it was OOMed. This PR disables the test. I will leave the JIRA open for a proper fix ## How was this patch tested? No new features. Author: Xiangrui Meng <meng@databricks.com> Closes #13478 from mengxr/SPARK-15740.
* [SPARK-15668][ML] ml.feature: update check schema to avoid confusion when ↵Yuhao Yang2016-06-024-36/+25
| | | | | | | | | | | | | | | user use MLlib.vector as input type ## What changes were proposed in this pull request? ml.feature: update check schema to avoid confusion when user use MLlib.vector as input type ## How was this patch tested? existing ut Author: Yuhao Yang <yuhao.yang@intel.com> Closes #13411 from hhbyyh/schemaCheck.
* [MINOR] clean up style for storage param setters in ALSNick Pentreath2016-06-021-6/+2
| | | | | | | | | | | Clean up style for param setter methods in ALS to match standard style and the other setter in class (this is an artefact of one of my previous PRs that wasn't cleaned up). ## How was this patch tested? Existing tests - no functionality change. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13480 from MLnick/als-param-minor-cleanup.
* [SPARK-15587][ML] ML 2.0 QA: Scala APIs audit for ml.featureYanbo Liang2016-06-014-16/+10
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? ML 2.0 QA: Scala APIs audit for ml.feature. Mainly include: * Remove seed for ```QuantileDiscretizer```, since we use ```approxQuantile``` to produce bins and ```seed``` is useless. * Scala API docs update. * Sync Scala and Python API docs for these changes. ## How was this patch tested? Exist tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13410 from yanboliang/spark-15587.
* [SPARK-15664][MLLIB] Replace FileSystem.get(conf) with ↵Lianhui Wang2016-06-015-20/+31
| | | | | | | | | | | | | | path.getFileSystem(conf) when removing CheckpointFile in MLlib ## What changes were proposed in this pull request? if sparkContext.set CheckpointDir to another Dir that is not default FileSystem, it will throw exception when removing CheckpointFile in MLlib. So we should always get the FileSystem from Path to avoid wrong FS problem. ## How was this patch tested? N/A Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #13408 from lianhuiwang/SPARK-15664.
* [SPARK-15618][SQL][MLLIB] Use SparkSession.builder.sparkContext if applicable.Dongjoon Hyun2016-05-3122-51/+49
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR changes function `SparkSession.builder.sparkContext(..)` from **private[sql]** into **private[spark]**, and uses it if applicable like the followings. ``` - val spark = SparkSession.builder().config(sc.getConf).getOrCreate() + val spark = SparkSession.builder().sparkContext(sc).getOrCreate() ``` ## How was this patch tested? Pass the existing Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13365 from dongjoon-hyun/SPARK-15618.
* [MINOR] Resolve a number of miscellaneous build warningsSean Owen2016-05-292-1/+5
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately. ## How was this patch tested? Jenkins Author: Sean Owen <sowen@cloudera.com> Closes #13377 from srowen/BuildWarnings.
* [SPARK-15610][ML] update error message for k in pcaZheng RuiFeng2016-05-272-4/+3
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix the wrong bound of `k` in `PCA` `require(k <= sources.first().size, ...` -> `require(k < sources.first().size` BTW, remove unused import in `ml.ElementwiseProduct` ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13356 from zhengruifeng/fix_pca.
* [SPARK-15413][ML][MLLIB] Change `toBreeze` to `asBreeze` in Vector and MatrixDB Tsai2016-05-2741-125/+125
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? We're using `asML` to convert the mllib vector/matrix to ml vector/matrix now. Using `as` is more correct given that this conversion actually shares the same underline data structure. As a result, in this PR, `toBreeze` will be changed to `asBreeze`. This is a private API, as a result, it will not affect any user's application. ## How was this patch tested? unit tests Author: DB Tsai <dbt@netflix.com> Closes #13198 from dbtsai/minor.
* [SPARK-11959][SPARK-15484][DOC][ML] Document WLS and IRLSYanbo Liang2016-05-271-1/+1
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? * Document ```WeightedLeastSquares```(normal equation) and ```IterativelyReweightedLeastSquares```. * Copy ```L-BFGS``` documents from ```spark.mllib``` to ```spark.ml```. Due to the session ```Optimization of linear methods``` is used for developers, I think we should provide the brief introduction of the optimization method, necessary references and how it implements in Spark. It's not necessary to paste all mathematical formula and derivation here. If developers/users want to learn more, they can track reference. ## How was this patch tested? Document update, no tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13262 from yanboliang/spark-15484.
* [HOTFIX] Scala 2.10 compile GaussianMixtureModelAndrew Or2016-05-271-1/+1
|
* [SPARK-15584][SQL] Abstract duplicate code: `spark.sql.sources.` propertiesDongjoon Hyun2016-05-271-1/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR replaces `spark.sql.sources.` strings with `CreateDataSourceTableUtils.*` constant variables. ## How was this patch tested? Pass the existing Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13349 from dongjoon-hyun/SPARK-15584.
* [SPARK-15603][MLLIB] Replace SQLContext with SparkSession in ML/MLLibDongjoon Hyun2016-05-2723-195/+160
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR replaces all deprecated `SQLContext` occurrences with `SparkSession` in `ML/MLLib` module except the following two classes. These two classes use `SQLContext` in their function signatures. - ReadWrite.scala - TreeModels.scala ## How was this patch tested? Pass the existing Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13352 from dongjoon-hyun/SPARK-15603.
* [MINOR] Fix Typos 'a -> an'Zheng RuiFeng2016-05-267-8/+8
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? `a` -> `an` I use regex to generate potential error lines: `grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala` and review them line by line. ## How was this patch tested? local build `lint-java` checking Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13317 from zhengruifeng/a_an.
* [SPARK-15532][SQL] SQLContext/HiveContext's public constructors should use ↵Yin Huai2016-05-261-1/+1
| | | | | | | | | | | | | | SparkSession.build.getOrCreate ## What changes were proposed in this pull request? This PR changes SQLContext/HiveContext's public constructor to use SparkSession.build.getOrCreate and removes isRootContext from SQLContext. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #13310 from yhuai/SPARK-15532.
* [SPARK-15457][MLLIB][ML] Eliminate some warnings from MLlib about deprecationsSean Owen2016-05-267-39/+44
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Several classes and methods have been deprecated and are creating lots of build warnings in branch-2.0. This issue is to identify and fix those items: * WithSGD classes: Change to make class not deprecated, object deprecated, and public class constructor deprecated. Any public use will require a deprecated API. We need to keep a non-deprecated private API since we cannot eliminate certain uses: Python API, streaming algs, and examples. * Use in PythonMLlibAPI: Change to using private constructors * Streaming algs: No warnings after we un-deprecate the classes * Examples: Deprecate or change ones which use deprecated APIs * MulticlassMetrics fields (precision, etc.) * LinearRegressionSummary.model field ## How was this patch tested? Existing tests. Checked for warnings manually. Author: Sean Owen <sowen@cloudera.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #13314 from jkbradley/warning-cleanups.
* [SPARK-15543][SQL] Rename DefaultSources to make them more self-describingReynold Xin2016-05-252-3/+7
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names. They are now named: - LibSVMFileFormat - CSVFileFormat - JdbcRelationProvider - JsonFileFormat - ParquetFileFormat - TextFileFormat Backward compatibility is maintained through aliasing. ## How was this patch tested? Updated relevant test cases too. Author: Reynold Xin <rxin@databricks.com> Closes #13311 from rxin/SPARK-15543.
* Log warnings for numIterations * miniBatchFraction < 1.0Gio Borje2016-05-251-0/+5
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used. This may be counter-intuitive to most users and led to the issue during the development of another Spark ML model: https://github.com/zhengruifeng/spark-libFM/issues/11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`. ## How was this patch tested? `build/mvn -DskipTests clean package` build succeeds Author: Gio Borje <gborje@linkedin.com> Closes #13265 from Hydrotoast/master.
* [SPARK-15500][DOC][ML][PYSPARK] Remove default value in Param doc field in ALSNick Pentreath2016-05-251-2/+2
| | | | | | | | | | | | Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice. We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields. Tests N/A. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13277 from MLnick/SPARK-15500-als-remove-default-storage-param.
* [MINOR][MLLIB][STREAMING][SQL] Fix typoslfzCarlosC2016-05-252-2/+2
| | | | | | | | | | fixed typos for source code for components [mllib] [streaming] and [SQL] None and obvious. Author: lfzCarlosC <lfz.carlos@gmail.com> Closes #13298 from lfzCarlosC/master.
* [SPARK-15442][ML][PYSPARK] Add 'relativeError' param to PySpark ↵Nick Pentreath2016-05-241-5/+8
| | | | | | | | | | | | | | | | | | QuantileDiscretizer This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` to match Scala. Also cleaned up a duplication of `numBuckets` where the param is both a class and instance attribute (I removed the instance attr to match the style of params throughout `ml`). Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it now uses `approxQuantile`. ## How was this patch tested? A little doctest and built API docs locally to check HTML doc generation. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13228 from MLnick/SPARK-15442-py-relerror-param.
* [SPARK-15339][ML] ML 2.0 QA: Scala APIs and code audit for regressionYanbo Liang2016-05-195-47/+58
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? * ```GeneralizedLinearRegression``` API docs enhancement. * The default value of ```GeneralizedLinearRegression``` ```linkPredictionCol``` is not set rather than empty. This will consistent with other similar params such as ```weightCol``` * Make some methods more private. * Fix a minor bug of LinearRegression. * Fix some other issues. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13129 from yanboliang/spark-15339.
* [SPARK-15075][SPARK-15345][SQL] Clean up SparkSession builder and propagate ↵Reynold Xin2016-05-191-1/+1
| | | | | | | | | | | | | | | | config options to existing sessions if specified ## What changes were proposed in this pull request? Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that. This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession. ## How was this patch tested? Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches. Author: Reynold Xin <rxin@databricks.com> Closes #13200 from rxin/SPARK-15075.
* [SPARK-15296][MLLIB] Refactor All Java Tests that use SparkSessionSandeep Singh2016-05-1958-1147/+206
| | | | | | | | | | | | ## What changes were proposed in this pull request? Refactor All Java Tests that use SparkSession, to extend SharedSparkSesion ## How was this patch tested? Existing Tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13101 from techaddict/SPARK-15296.
* [MINOR][ML][PYSPARK] ml.evaluation Scala and Python API syncYanbo Liang2016-05-191-1/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? ```ml.evaluation``` Scala and Python API sync. ## How was this patch tested? Only API docs change, no new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13195 from yanboliang/evaluation-doc.
* [SPARK-15341][DOC][ML] Add documentation for "model.write" to clarify ↵Yanbo Liang2016-05-195-2/+23
| | | | | | | | | | | | | | | "summary" was not saved ## What changes were proposed in this pull request? Currently in ```model.write```, we don't save ```summary```(if applicable). We should add documentation to clarify it. We fixed the incorrect link ```[[MLWriter]]``` to ```[[org.apache.spark.ml.util.MLWriter]]``` BTW. ## How was this patch tested? Documentation update, no unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13131 from yanboliang/spark-15341.