| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately.
## How was this patch tested?
Jenkins
Author: Sean Owen <sowen@cloudera.com>
Closes #13377 from srowen/BuildWarnings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Fix the wrong bound of `k` in `PCA`
`require(k <= sources.first().size, ...` -> `require(k < sources.first().size`
BTW, remove unused import in `ml.ElementwiseProduct`
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #13356 from zhengruifeng/fix_pca.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
We're using `asML` to convert the mllib vector/matrix to ml vector/matrix now. Using `as` is more correct given that this conversion actually shares the same underline data structure. As a result, in this PR, `toBreeze` will be changed to `asBreeze`. This is a private API, as a result, it will not affect any user's application.
## How was this patch tested?
unit tests
Author: DB Tsai <dbt@netflix.com>
Closes #13198 from dbtsai/minor.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
* Document ```WeightedLeastSquares```(normal equation) and ```IterativelyReweightedLeastSquares```.
* Copy ```L-BFGS``` documents from ```spark.mllib``` to ```spark.ml```.
Due to the session ```Optimization of linear methods``` is used for developers, I think we should provide the brief introduction of the optimization method, necessary references and how it implements in Spark. It's not necessary to paste all mathematical formula and derivation here. If developers/users want to learn more, they can track reference.
## How was this patch tested?
Document update, no tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #13262 from yanboliang/spark-15484.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR replaces `spark.sql.sources.` strings with `CreateDataSourceTableUtils.*` constant variables.
## How was this patch tested?
Pass the existing Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #13349 from dongjoon-hyun/SPARK-15584.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR replaces all deprecated `SQLContext` occurrences with `SparkSession` in `ML/MLLib` module except the following two classes. These two classes use `SQLContext` in their function signatures.
- ReadWrite.scala
- TreeModels.scala
## How was this patch tested?
Pass the existing Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #13352 from dongjoon-hyun/SPARK-15603.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
`a` -> `an`
I use regex to generate potential error lines:
`grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
and review them line by line.
## How was this patch tested?
local build
`lint-java` checking
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #13317 from zhengruifeng/a_an.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SparkSession.build.getOrCreate
## What changes were proposed in this pull request?
This PR changes SQLContext/HiveContext's public constructor to use SparkSession.build.getOrCreate and removes isRootContext from SQLContext.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Closes #13310 from yhuai/SPARK-15532.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Several classes and methods have been deprecated and are creating lots of build warnings in branch-2.0. This issue is to identify and fix those items:
* WithSGD classes: Change to make class not deprecated, object deprecated, and public class constructor deprecated. Any public use will require a deprecated API. We need to keep a non-deprecated private API since we cannot eliminate certain uses: Python API, streaming algs, and examples.
* Use in PythonMLlibAPI: Change to using private constructors
* Streaming algs: No warnings after we un-deprecate the classes
* Examples: Deprecate or change ones which use deprecated APIs
* MulticlassMetrics fields (precision, etc.)
* LinearRegressionSummary.model field
## How was this patch tested?
Existing tests. Checked for warnings manually.
Author: Sean Owen <sowen@cloudera.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #13314 from jkbradley/warning-cleanups.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
See https://issues.apache.org/jira/browse/SPARK-15523
This PR replaces PR #13293. It's isolated to a new branch, and contains some more squashed changes.
## How was this patch tested?
1. Executed `mvn clean package` in `mllib` directory
2. Executed `dev/test-dependencies.sh --replace-manifest` in the root directory.
Author: Villu Ruusmann <villu.ruusmann@gmail.com>
Closes #13297 from vruusmann/update-jpmml.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names.
They are now named:
- LibSVMFileFormat
- CSVFileFormat
- JdbcRelationProvider
- JsonFileFormat
- ParquetFileFormat
- TextFileFormat
Backward compatibility is maintained through aliasing.
## How was this patch tested?
Updated relevant test cases too.
Author: Reynold Xin <rxin@databricks.com>
Closes #13311 from rxin/SPARK-15543.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used.
This may be counter-intuitive to most users and led to the issue during the development of another Spark ML model: https://github.com/zhengruifeng/spark-libFM/issues/11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`.
## How was this patch tested?
`build/mvn -DskipTests clean package` build succeeds
Author: Gio Borje <gborje@linkedin.com>
Closes #13265 from Hydrotoast/master.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice.
We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields.
Tests N/A.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes #13277 from MLnick/SPARK-15500-als-remove-default-storage-param.
|
|
|
|
|
|
|
|
|
|
| |
fixed typos for source code for components [mllib] [streaming] and [SQL]
None and obvious.
Author: lfzCarlosC <lfz.carlos@gmail.com>
Closes #13298 from lfzCarlosC/master.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
QuantileDiscretizer
This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` to match Scala.
Also cleaned up a duplication of `numBuckets` where the param is both a class and instance attribute (I removed the instance attr to match the style of params throughout `ml`).
Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it now uses `approxQuantile`.
## How was this patch tested?
A little doctest and built API docs locally to check HTML doc generation.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes #13228 from MLnick/SPARK-15442-py-relerror-param.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
* ```GeneralizedLinearRegression``` API docs enhancement.
* The default value of ```GeneralizedLinearRegression``` ```linkPredictionCol``` is not set rather than empty. This will consistent with other similar params such as ```weightCol```
* Make some methods more private.
* Fix a minor bug of LinearRegression.
* Fix some other issues.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #13129 from yanboliang/spark-15339.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
config options to existing sessions if specified
## What changes were proposed in this pull request?
Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that.
This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession.
## How was this patch tested?
Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches.
Author: Reynold Xin <rxin@databricks.com>
Closes #13200 from rxin/SPARK-15075.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Refactor All Java Tests that use SparkSession, to extend SharedSparkSesion
## How was this patch tested?
Existing Tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes #13101 from techaddict/SPARK-15296.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
```ml.evaluation``` Scala and Python API sync.
## How was this patch tested?
Only API docs change, no new tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #13195 from yanboliang/evaluation-doc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
"summary" was not saved
## What changes were proposed in this pull request?
Currently in ```model.write```, we don't save ```summary```(if applicable). We should add documentation to clarify it.
We fixed the incorrect link ```[[MLWriter]]``` to ```[[org.apache.spark.ml.util.MLWriter]]``` BTW.
## How was this patch tested?
Documentation update, no unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #13131 from yanboliang/spark-15341.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Open up APIs for converting between new, old linear algebra types (in spark.mllib.linalg):
`Sparse`/`Dense` X `Vector`/`Matrices` `.asML` and `.fromML`
## How was this patch tested?
Existing Tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes #13202 from techaddict/SPARK-15414.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Audit Scala API for ml.clustering.
Fix some wrong API documentations and update outdated one.
## How was this patch tested?
Existing unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #13148 from yanboliang/spark-15361.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add since to ml.stat.MultivariateOnlineSummarizer.scala
## How was this patch tested?
unit tests
Author: DB Tsai <dbt@netflix.com>
Closes #13197 from dbtsai/cleanup.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Audit Scala API for classification, almost all issues were related ```MultilayerPerceptronClassifier``` in this section.
* Fix one wrong param getter function: ```getOptimizer``` -> ```getSolver```
* Add missing setter function for ```solver``` and ```stepSize```.
* Make ```GD``` solver take effect.
* Update docs, annotations and fix other minor issues.
## How was this patch tested?
Existing unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #13076 from yanboliang/spark-15292.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
[SPARK-14646](https://issues.apache.org/jira/browse/SPARK-14646) makes ```KMeansModel``` store the cluster centers one per row. ```KMeansModel.load()``` method needs to be updated in order to load models saved with Spark 1.6.
## How was this patch tested?
Since ```save/load``` is ```Experimental``` for 1.6, I think offline test for backwards compatibility is enough.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #13149 from yanboliang/spark-15362.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
I reviewed Scala and Python APIs for ml.feature and corrected discrepancies.
## How was this patch tested?
Built docs locally, ran style checks
Author: Bryan Cutler <cutlerb@gmail.com>
Closes #13159 from BryanCutler/ml.feature-api-sync.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema.
## How was this patch tested?
new tests in `DatasetSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes #13008 from cloud-fan/row-encoder.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR adds schema validation to `ml`'s ALS and ALSModel. Currently, no schema validation was performed as `transformSchema` was never called in `ALS.fit` or `ALSModel.transform`. Furthermore, due to no schema validation, if users passed in Long (or Float etc) ids, they would be silently cast to Int with no warning or error thrown.
With this PR, ALS now supports all numeric types for `user`, `item`, and `rating` columns. The rating column is cast to `Float` and the user and item cols are cast to `Int` (as is the case currently) - however for user/item, the cast throws an error if the value is outside integer range. Behavior for rating col is unchanged (as it is not an issue).
## How was this patch tested?
New test cases in `ALSSuite`.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes #12762 from MLnick/SPARK-14891-als-validate-schema.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mateiz srowen
I state that the contribution is my original work and that I license the work to the project under the project's open source license
There's some format problems with my last PR, with HyukjinKwon 's help I read the guidance, re-check my code and PR, then run the tests, finally re-submit the PR request here.
The related JIRA issue though marked as resolved, this change may relate to it I think.
## Proposed Change
After picking each new initial centers, it's unnecessary to compute the distances between all the points and the old ones.
Instead this change keeps the distance between all the points and their closest centers, and compare to the distance of them with the new center then update them.
## Test result
One can find an easy test way in (https://issues.apache.org/jira/browse/SPARK-6706)
I test the KMeans++ method for a small dataset with 16k points, and the whole KMeans|| with a large one with 240k points.
The data has 4096 features and I tunes K from 100 to 500.
The test environment was on my 4 machine cluster, I also tested a 3M points data on a larger cluster with 25 machines and got similar results, which I would not draw the detail curve. The result of the first two exps are shown below
### Local KMeans++ test:
Dataset:4m_ini_center
Data_size:16234
Dimension:4096
Lloyd's Iteration = 10
The y-axis is time in sec, the x-axis is tuning the K.
![image](https://cloud.githubusercontent.com/assets/10915169/15175831/d0c92b82-179a-11e6-8b68-4e165fc2fdff.png)
![local_total](https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg)
### On a larger dataset
An improve show in the graph but not commit in this file: In this experiment I also have an improvement for calculation in normalization data (the distance is convert to the cosine distance). As if the data is normalized into (0,1), one improvement in the original vesion for util.MLUtils.fastSauaredDistance would have no effect (the precisionBound 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON) will never less then precision in this case). Therefore I design an early terminal method when comparing two distance (used for findClosest). But I don't include this improve in this file, you may only refer to the curves without "normalize" for comparing the results.
Dataset:4k24
Data_size:243960
Dimension:4096
Normlize Enlarge Initialize Lloyd's_Iteration
NO 1 3 5
YES 10000 3 5
Notice: the normlized data is enlarged to ensure precision
The cost time: x-for value of K, y-for time in sec
![4k24_total](https://cloud.githubusercontent.com/assets/10915169/15176635/9a54c0bc-179e-11e6-81c5-238e0c54bce2.jpg)
SE for unnormalized data between two version, to ensure the correctness
![4k24_unnorm_se](https://cloud.githubusercontent.com/assets/10915169/15176661/b85dabc8-179e-11e6-9269-fe7d2101dd48.jpg)
Here is the SE between normalized data just for reference, it's also correct.
![4k24_norm_se](https://cloud.githubusercontent.com/assets/10915169/15176742/1fbde940-179f-11e6-8290-d24b0dd4a4f7.jpg)
Author: DLucky <mouendless@gmail.com>
Closes #13133 from mouendless/patch-2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
accumulatorV2 in spark project
## What changes were proposed in this pull request?
I use Intellj-IDEA to search usage of deprecate SparkContext.accumulator in the whole spark project, and update the code.(except those test code for accumulator method itself)
## How was this patch tested?
Exisiting unit tests
Author: WeichenXu <WeichenXu123@outlook.com>
Closes #13112 from WeichenXu123/update_accuV2_in_mllib.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
dataset.registerTempTable
## What changes were proposed in this pull request?
Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.
## How was this patch tested?
This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.
Author: Sean Zhong <seanzhong@databricks.com>
Closes #13098 from clockfly/spark-15171-remove-deprecation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
algorithms
## What changes were proposed in this pull request?
Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis.
## How was this patch tested?
Unit tests
Author: DB Tsai <dbt@netflix.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes #12627 from dbtsai/SPARK-14615-NewML.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
According to the recent change, this PR replaces all the remaining `sqlContext` usage with `spark` in ScalaDoc/JavaDoc (.scala/.java files) except `SQLContext.scala`, `SparkPlan.scala', and `DatasetHolder.scala`.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #13125 from dongjoon-hyun/minor_doc_sparksession.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
spark-tags
## What changes were proposed in this pull request?
(See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.)
Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags`
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes #13074 from srowen/SPARK-15290.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
1,Rename matrix args in BreezeUtil to upper to match the doc
2,Fix several typos in ML and SQL
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #13078 from zhengruifeng/fix_ann.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
fail if no valid labels are found
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Throw better exception when numClasses is empty and empty.max is thrown.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Add a new unit test, which calls histogram with empty numClasses.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes #12969 from wangmiao1981/logisticR.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
default compression for ORC
## What changes were proposed in this pull request?
Currently, Parquet, JSON and CSV data sources have a class for thier options, (`ParquetOptions`, `JSONOptions` and `CSVOptions`).
It is convenient to manage options for sources to gather options into a class. Currently, `JDBC`, `Text`, `libsvm` and `ORC` datasources do not have this class. This might be nicer if these options are in a unified format so that options can be added and
This PR refactors the options in Spark internal data sources adding new classes, `OrcOptions`, `TextOptions`, `JDBCOptions` and `LibSVMOptions`.
Also, this PR change the default compression codec for ORC from `NONE` to `SNAPPY`.
## How was this patch tested?
Existing tests should cover this for refactoring and unittests in `OrcHadoopFsRelationSuite` for changing the default compression codec for ORC.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #13048 from HyukjinKwon/SPARK-15267.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Add accuracy to MulticlassMetrics class and add corresponding code in MulticlassClassificationEvaluator.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Scala Unit tests in ml.evaluation
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes #12882 from wangmiao1981/accuracy.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
numeric types for label
## What changes were proposed in this pull request?
Made ChiSqSelector and RFormula accept all numeric types for label
## How was this patch tested?
Unit tests
Author: BenFradet <benjamin.fradet@gmail.com>
Closes #12467 from BenFradet/SPARK-13961.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This patch adds a python API for generalized linear regression summaries (training and test). This helps provide feature parity for Python GLMs.
## How was this patch tested?
Added a unit test to `pyspark.ml.tests`
Author: sethah <seth.hendrickson16@gmail.com>
Closes #12961 from sethah/GLR_summary.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView.
## How was this patch tested?
Unit tests.
Author: Sean Zhong <seanzhong@databricks.com>
Closes #12945 from clockfly/spark-15171.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
We have a private `UDTRegistration` API to register user defined type. Currently `JavaTypeInference` can't work with it. So `SparkSession.createDataFrame` from a bean class will not correctly infer the schema of the bean class.
## How was this patch tested?
`VectorUDTSuite`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes #13046 from viirya/fix-udt-registry-javatypeinference.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Scala/Java TestSuites
## What changes were proposed in this pull request?
Use SparkSession instead of SQLContext in Scala/Java TestSuites
as this PR already very big working Python TestSuites in a diff PR.
## How was this patch tested?
Existing tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes #12907 from techaddict/SPARK-15037.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
size mismatch happened in LogisticRegression
## What changes were proposed in this pull request?
Explicitly tell user initial coefficients is ignored if its size doesn't match expected size in LogisticRegression
## How was this patch tested?
local build
Author: dding3 <dingding@dingding-ubuntu.sh.intel.com>
Closes #12948 from dding3/master.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14814
fix a java compatibility function in mllib DecisionTreeModel. As synced in jira, other compatibility issues don't need fixes.
## How was this patch tested?
existing ut
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #12971 from hhbyyh/javacompatibility.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
We need to use `requiredSchema` in `LibSVMRelation` to project the fetch required columns when loading data from this data source. Otherwise, when users try to select `features` column, it will cause failure.
## How was this patch tested?
`LibSVMRelationSuite`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes #12986 from viirya/fix-libsvmrelation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Stop Words Remover
## What changes were proposed in this pull request?
This PR continues the work from #11871 with the following changes:
* load English stopwords as default
* covert stopwords to list in Python
* update some tests and doc
## How was this patch tested?
Unit tests.
Closes #11871
cc: burakkose srowen
Author: Burak Köse <burakks41@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Burak KOSE <burakks41@gmail.com>
Closes #12843 from mengxr/SPARK-14050.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Minor doc and code style fixes
## How was this patch tested?
local build
Author: Jacek Laskowski <jacek@japila.pl>
Closes #12928 from jaceklaskowski/SPARK-15152.
|