aboutsummaryrefslogtreecommitdiff
path: root/mllib
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-14314][SPARK-14315][ML][SPARKR] Model persistence in SparkR (glm & ↵Yanbo Liang2016-04-295-64/+188
| | | | | | | | | | | | | | | kmeans) SparkR ```glm``` and ```kmeans``` model persistence. Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Author: Gayathri Murali <gayathri.m.softie@gmail.com> Closes #12778 from yanboliang/spark-14311. Closes #12680 Closes #12683
* [SPARK-14571][ML] Log instrumentation in ALSwm624@hotmail.com2016-04-292-0/+12
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add log instrumentation for parameters: rank, numUserBlocks, numItemBlocks, implicitPrefs, alpha, userCol, itemCol, ratingCol, predictionCol, maxIter, regParam, nonnegative, checkpointInterval, seed Add log instrumentation for numUserFeatures and numItemFeatures ## How was this patch tested? Manual test: Set breakpoint in intellij and run def testALS(). Single step debugging and check the log method is called. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12560 from wangmiao1981/log.
* [SPARK-14969][MLLIB] Remove duplicate implementation of compute in ↵dding32016-04-292-13/+0
| | | | | | | | | | | | | | | | LogisticGradient ## What changes were proposed in this pull request? This PR removes duplicate implementation of compute in LogisticGradient class ## How was this patch tested? unit tests Author: dding3 <dingding@dingding-ubuntu.sh.intel.com> Closes #12747 from dding3/master.
* [SPARK-14886][MLLIB] RankingMetrics.ndcgAt throw ↵Sean Owen2016-04-292-6/+22
| | | | | | | | | | | | | | | | java.lang.ArrayIndexOutOfBoundsException ## What changes were proposed in this pull request? Handle case where number of predictions is less than label set, k in nDCG computation ## How was this patch tested? New unit test; existing tests Author: Sean Owen <sowen@cloudera.com> Closes #12756 from srowen/SPARK-14886.
* [SPARK-14829][MLLIB] Deprecate GLM APIs using SGDZheng RuiFeng2016-04-284-0/+12
| | | | | | | | | | | | ## What changes were proposed in this pull request? According to the [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829), deprecate API of LogisticRegression and LinearRegression using SGD ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12596 from zhengruifeng/deprecate_sgd.
* Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in ↵Yin Huai2016-04-281-1/+1
| | | | | | spark-mllib-local" This reverts commit dae538a4d7c36191c1feb02ba87ffc624ab960dc.
* [SPARK-14862][ML] Updated Classifiers to not require labelCol metadataJoseph K. Bradley2016-04-288-31/+245
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Updated Classifier, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier to not require input column metadata. * They first check for metadata. * If numClasses is not specified in metadata, they identify the largest label value (up to a limit). This functionality is implemented in a new Classifier.getNumClasses method. Also * Updated Classifier.extractLabeledPoints to (a) check label values and (b) include a second version which takes a numClasses value for validity checking. ## How was this patch tested? * Unit tests in ClassifierSuite for helper methods * Unit tests for DecisionTreeClassifier, RandomForestClassifier, GBTClassifier with toy datasets lacking label metadata Author: Joseph K. Bradley <joseph@databricks.com> Closes #12663 from jkbradley/trees-no-metadata.
* [SPARK-14613][ML] Add @Since into the matrix and vector classes in ↵Pravin Gadakh2016-04-281-1/+1
| | | | | | | | | | | | | | | | spark-mllib-local ## What changes were proposed in this pull request? This PR adds `since` tag into the matrix and vector classes in spark-mllib-local. ## How was this patch tested? Scala-style checks passed. Author: Pravin Gadakh <prgadakh@in.ibm.com> Closes #12416 from pravingadakh/SPARK-14613.
* [SPARK-14916][MLLIB] A more friendly tostring for FreqItemset in mllib.fpmYuhao Yang2016-04-281-0/+4
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-14916 FreqItemset as the result of FPGrowth should have a more friendly toString(), to help users and developers. sample: {a, b}: 5 {x, y, z}: 4 ## How was this patch tested? existing unit tests. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #12698 from hhbyyh/freqtos.
* [SPARK-14852][ML] refactored GLM summary into training, non-training summariesJoseph K. Bradley2016-04-282-55/+115
| | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This splits GeneralizedLinearRegressionSummary into 2 summary types: * GeneralizedLinearRegressionSummary, which does not store info from fitting (diagInvAtWA) * GeneralizedLinearRegressionTrainingSummary, which is a subclass of GeneralizedLinearRegressionSummary and stores info from fitting This also add a method evaluate() which can produce a GeneralizedLinearRegressionSummary on a new dataset. The summary no longer provides the model itself as a public val. Also: * Fixes bug where GeneralizedLinearRegressionTrainingSummary was created with model, not summaryModel. * Adds hasSummary method. * Renames findSummaryModelAndPredictionCol -> getSummaryModel and simplifies that method. * In summary, extract values from model immediately in case user later changes those (e.g., predictionCol). * Pardon the style fixes; that is IntelliJ being obnoxious. ## How was this patch tested? Existing unit tests + updated test for evaluate and hasSummary Author: Joseph K. Bradley <joseph@databricks.com> Closes #12624 from jkbradley/model-summary-api.
* [SPARK-14487][SQL] User Defined Type registration without SQLUserDefinedType ↵Liang-Chi Hsieh2016-04-284-0/+289
| | | | | | | | | | | | | | | | | | | | annotation ## What changes were proposed in this pull request? Currently we use `SQLUserDefinedType` annotation to register UDTs for user classes. However, by doing this, we add Spark dependency to user classes. For some user classes, it is unnecessary to add such dependency that will increase deployment difficulty. We should provide alternative approach to register UDTs for user classes without `SQLUserDefinedType` annotation. ## How was this patch tested? `UserDefinedTypeSuite` Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12259 from viirya/improve-sql-usertype.
* [SPARK-14671][ML] Pipeline setStages should handle subclasses of PipelineStageJoseph K. Bradley2016-04-272-2/+12
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Pipeline.setStages failed for some code examples which worked in 1.5 but fail in 1.6. This tends to occur when using a mix of transformers from ml.feature. It is because Java Arrays are non-covariant and the addition of MLWritable to some transformers means the stages0/1 arrays above are not of type Array[PipelineStage]. This PR modifies the following to accept subclasses of PipelineStage: * Pipeline.setStages() * Params.w() ## How was this patch tested? Unit test which fails to compile before this fix. Author: Joseph K. Bradley <joseph@databricks.com> Closes #12430 from jkbradley/pipeline-setstages.
* [SPARK-14899][ML][PYSPARK] Remove spark.ml HashingTF hashingAlg optionYanbo Liang2016-04-273-55/+22
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Since [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574) breaks behavior of ```HashingTF```, we should try to enforce good practice by removing the "native" hashAlgorithm option in spark.ml and pyspark.ml. We can leave spark.mllib and pyspark.mllib alone. ## How was this patch tested? Unit tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12702 from yanboliang/spark-14899.
* [SPARK-9656][MLLIB][PYTHON] Add missing methods to PySpark's Distributed ↵Mike Dusenberry2016-04-271-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Linear Algebra Classes This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows: * `RowMatrix` <sup>**[1]**</sup> 1. `computeGramianMatrix` 2. `computeCovariance` 3. `computeColumnSummaryStatistics` 4. `columnSimilarities` 5. `tallSkinnyQR` <sup>**[2]**</sup> * `IndexedRowMatrix` <sup>**[3]**</sup> 1. `computeGramianMatrix` * `CoordinateMatrix` 1. `transpose` * `BlockMatrix` 1. `validate` 2. `cache` 3. `persist` 4. `transpose` **[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR #7963 for SPARK-6227. **[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR. Therefore, I have marked this PR as WIP and am open to discussion. **[3]**: Note: `multiply` and `computeSVD` are already part of PR #7963 for SPARK-6227. Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.
* [SPARK-14732][ML] spark.ml GaussianMixture should use MultivariateGaussian ↵Joseph K. Bradley2016-04-262-37/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | in mllib-local ## What changes were proposed in this pull request? Before, spark.ml GaussianMixtureModel used the spark.mllib MultivariateGaussian in its public API. This was added after 1.6, so we can modify this API without breaking APIs. This PR copies MultivariateGaussian to mllib-local in spark.ml, with a few changes: * Renamed fields to match numpy, scipy: mu => mean, sigma => cov This PR then uses the spark.ml MultivariateGaussian in the spark.ml GaussianMixtureModel, which involves: * Modifying the constructor * Adding a computeProbabilities method Also: * Added EPSILON to mllib-local for use in MultivariateGaussian ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12593 from jkbradley/sparkml-gmm-fix.
* [SPARK-12301][ML] Made all tree and ensemble classes not finalJoseph K. Bradley2016-04-268-16/+16
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? There have been continuing requests (e.g., SPARK-7131) for allowing users to extend and modify MLlib models and algorithms. This PR makes tree and ensemble classes, Node types, and Split types in spark.ml no longer final. This matches most other spark.ml algorithms. Constructors for models are still private since we may need to refactor how stats are maintained in tree nodes. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12711 from jkbradley/final-trees.
* [SPARK-14907][MLLIB] Use repartition in GLMRegressionModel.saveDongjoon Hyun2016-04-261-4/+1
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR changes `GLMRegressionModel.save` function like the following code that is similar to other algorithms' parquet write. ``` - val dataRDD: DataFrame = sc.parallelize(Seq(data), 1).toDF() - // TODO: repartition with 1 partition after SPARK-5532 gets fixed - dataRDD.write.parquet(Loader.dataPath(path)) + sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(Loader.dataPath(path)) ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12676 from dongjoon-hyun/SPARK-14907.
* [SPARK-11559][MLLIB] Make `runs` no effect in mllib.KMeansYanbo Liang2016-04-262-32/+11
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? We deprecated ```runs``` of mllib.KMeans in Spark 1.6 (SPARK-11358). In 2.0, we will make it no effect (with warning messages). We did not remove ```setRuns/getRuns``` for better binary compatibility. This PR change `runs` which are appeared at the public API. Usage inside of ```KMeans.runAlgorithm()``` will be resolved at #10806. ## How was this patch tested? Existing unit tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12608 from yanboliang/spark-11559.
* [MINOR] Follow-up to #12625Andrew Or2016-04-261-4/+4
| | | | | | | | | | ## What changes were proposed in this pull request? That patch mistakenly widened the visibility from `private[x]` to `protected[x]`. This patch reverts those changes. Author: Andrew Or <andrew@databricks.com> Closes #12686 from andrewor14/visibility.
* [SPARK-14912][SQL] Propagate data source options to Hadoop configurationReynold Xin2016-04-261-5/+5
| | | | | | | | | | | | ## What changes were proposed in this pull request? We currently have no way for users to propagate options to the underlying library that rely in Hadoop configurations to work. For example, there are various options in parquet-mr that users might want to set, but the data source API does not expose a per-job way to set it. This patch propagates the user-specified options also into Hadoop Configuration. ## How was this patch tested? Used a mock data source implementation to test both the read path and the write path. Author: Reynold Xin <rxin@databricks.com> Closes #12688 from rxin/SPARK-14912.
* [SPARK-14313][ML][SPARKR] AFTSurvivalRegression model persistence in SparkRYanbo Liang2016-04-262-3/+51
| | | | | | | | | | | | ## What changes were proposed in this pull request? ```AFTSurvivalRegressionModel``` supports ```save/load``` in SparkR. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12685 from yanboliang/spark-14313.
* [SPARK-13962][ML] spark.ml Evaluators should support other numeric types for ↵BenFradet2016-04-268-51/+88
| | | | | | | | | | | | | | | | label ## What changes were proposed in this pull request? Made BinaryClassificationEvaluator, MulticlassClassificationEvaluator and RegressionEvaluator accept all numeric types for label ## How was this patch tested? Unit tests Author: BenFradet <benjamin.fradet@gmail.com> Closes #12500 from BenFradet/SPARK-13962.
* [SPARK-14861][SQL] Replace internal usages of SQLContext with SparkSessionAndrew Or2016-04-2514-51/+52
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In Spark 2.0, `SparkSession` is the new thing. Internally we should stop using `SQLContext` everywhere since that's supposed to be not the main user-facing API anymore. In this patch I took care to not break any public APIs. The one place that's suspect is `o.a.s.ml.source.libsvm.DefaultSource`, but according to mengxr it's not supposed to be public so it's OK to change the underlying `FileFormat` trait. **Reviewers**: This is a big patch that may be difficult to review but the changes are actually really straightforward. If you prefer I can break it up into a few smaller patches, but it will delay the progress of this issue a little. ## How was this patch tested? No change in functionality intended. Author: Andrew Or <andrew@databricks.com> Closes #12625 from andrewor14/spark-session-refactor.
* [SPARK-14312][ML][SPARKR] NaiveBayes model persistence in SparkRYanbo Liang2016-04-252-3/+94
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? SparkR ```NaiveBayesModel``` supports ```save/load``` by the following API: ``` df <- createDataFrame(sqlContext, infert) model <- naiveBayes(education ~ ., df, laplace = 0) ml.save(model, path) model2 <- ml.load(path) ``` ## How was this patch tested? Add unit tests. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #12573 from yanboliang/spark-14312.
* [SPARK-10574][ML][MLLIB] HashingTF supports MurmurHash3Yanbo Liang2016-04-253-21/+129
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? As the discussion at [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574), ```HashingTF``` should support MurmurHash3 and make it as the default hash algorithm. We should also expose set/get API for ```hashAlgorithm```, then users can choose the hash method. Note: The problem that ```mllib.feature.HashingTF``` behaves differently between Scala/Java and Python will be resolved in the followup work. ## How was this patch tested? unit tests. cc jkbradley MLnick Author: Yanbo Liang <ybliang8@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #12498 from yanboliang/spark-10574.
* [SPARK-14433][PYSPARK][ML] PySpark ml GaussianMixturewm624@hotmail.com2016-04-251-2/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add Python API in ML for GaussianMixture ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Add doctest and test cases are the same as mllib Python tests ./dev/lint-python PEP8 checks passed. rm -rf _build/* pydoc checks passed. ./python/run-tests --python-executables=python2.7 --modules=pyspark-ml Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log Will test against the following Python executables: ['python2.7'] Will test the following Python modules: ['pyspark-ml'] Finished test(python2.7): pyspark.ml.evaluation (18s) Finished test(python2.7): pyspark.ml.clustering (40s) Finished test(python2.7): pyspark.ml.classification (49s) Finished test(python2.7): pyspark.ml.recommendation (44s) Finished test(python2.7): pyspark.ml.feature (64s) Finished test(python2.7): pyspark.ml.regression (45s) Finished test(python2.7): pyspark.ml.tuning (30s) Finished test(python2.7): pyspark.ml.tests (56s) Tests passed in 106 seconds Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12402 from wangmiao1981/gmm.
* [SPARK-14758][ML] Add checking for StepSize and TolZheng RuiFeng2016-04-251-2/+4
| | | | | | | | | | | | ## What changes were proposed in this pull request? add the checking for StepSize and Tol in sharedParams ## How was this patch tested? Unit tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12530 from zhengruifeng/ml_args_checking.
* [SPARK-14868][BUILD] Enable NewLineAtEofChecker in checkstyle and fix ↵Dongjoon Hyun2016-04-243-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | lint-java errors ## What changes were proposed in this pull request? Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java code also comply with the rule. This PR aims to enforce the same rule `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java errors since SPARK-14465. The followings are the items. - Adds a new line at the end of the files (19 files) - Fixes 25 lint-java errors (12 RedundantModifier, 6 **ArrayTypeStyle**, 2 LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder) ## How was this patch tested? After the Jenkins test succeeds, `dev/lint-java` should pass. (Currently, Jenkins dose not run lint-java.) ```bash $ dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12632 from dongjoon-hyun/SPARK-14868.
* [MINOR][ML][MLLIB] Remove unused importsZheng RuiFeng2016-04-2213-16/+11
| | | | | | | | | | | | ## What changes were proposed in this pull request? del unused imports in ML/MLLIB ## How was this patch tested? unit tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12497 from zhengruifeng/del_unused_imports.
* [SPARK-14843][ML] Fix encoding error in LibSVMRelationLiang-Chi Hsieh2016-04-232-5/+13
| | | | | | | | | | | | | ## What changes were proposed in this pull request? We use `RowEncoder` in libsvm data source to serialize the label and features read from libsvm files. However, the schema passed in this encoder is not correct. As the result, we can't correctly select `features` column from the DataFrame. We should use full data schema instead of `requiredSchema` to serialize the data read in. Then do projection to select required columns later. ## How was this patch tested? `LibSVMRelationSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12611 from viirya/fix-libsvm.
* [MINOR][DOC] Fix doc style in ml.ann.Layer and MultilayerPerceptronClassifierZheng RuiFeng2016-04-222-40/+40
| | | | | | | | | | | | | ## What changes were proposed in this pull request? 1, fix the indentation 2, add a missing param desc ## How was this patch tested? unit tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12499 from zhengruifeng/fix_doc.
* [SPARK-6429] Implement hashCode and equals togetherJoan2016-04-223-6/+22
| | | | | | | | | | | ## What changes were proposed in this pull request? Implement some `hashCode` and `equals` together in order to enable the scalastyle. This is a first batch, I will continue to implement them but I wanted to know your thoughts. Author: Joan <joan@goyeau.com> Closes #12157 from joan38/SPARK-6429-HashCode-Equals.
* [SPARK-14479][ML] GLM supports output link predictionYanbo Liang2016-04-212-34/+108
| | | | | | | | | | | ## What changes were proposed in this pull request? GLM supports output link prediction. ## How was this patch tested? unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12287 from yanboliang/spark-14479.
* [SPARK-14734][ML][MLLIB] Added asML, fromML methods for all spark.mllib ↵Joseph K. Bradley2016-04-214-2/+135
| | | | | | | | | | | | | | | | | Vector, Matrix types ## What changes were proposed in this pull request? For maintaining wrappers around spark.mllib algorithms in spark.ml, it will be useful to have ```private[spark]``` methods for converting from one linear algebra representation to another. This PR adds toNew, fromNew methods for all spark.mllib Vector and Matrix types. ## How was this patch tested? Unit tests for all conversions Author: Joseph K. Bradley <joseph@databricks.com> Closes #12504 from jkbradley/linalg-conversions.
* [SPARK-14569][ML] Log instrumentation in KMeansXin Ren2016-04-213-6/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14569 Log instrumentation in KMeans: - featuresCol - predictionCol - k - initMode - initSteps - maxIter - seed - tol - summary ## How was this patch tested? Manually test on local machine, by running and checking output of org.apache.spark.examples.ml.KMeansExample Author: Xin Ren <iamshrek@126.com> Closes #12432 from keypointt/SPARK-14569.
* [SPARK-14478][ML][MLLIB][DOC] Doc that StandardScaler uses the corrected ↵Joseph K. Bradley2016-04-202-0/+10
| | | | | | | | | | | | | | | | | | sample std ## What changes were proposed in this pull request? Currently, MLlib's StandardScaler scales columns using the corrected standard deviation (sqrt of unbiased variance). This matches what R's scale package does. This PR documents this fact. ## How was this patch tested? doc only Author: Joseph K. Bradley <joseph@databricks.com> Closes #12519 from jkbradley/scaler-variance-doc.
* [SPARK-14687][CORE][SQL][MLLIB] Call path.getFileSystem(conf) instead of ↵Liwei Lin2016-04-202-3/+5
| | | | | | | | | | | | | | | | call FileSystem.get(conf) ## What changes were proposed in this pull request? - replaced `FileSystem.get(conf)` calls with `path.getFileSystem(conf)` ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #12450 from lw-lin/fix-fs-get.
* [SPARK-14407][SQL] Hides HadoopFsRelation related data source API into ↵Cheng Lian2016-04-191-1/+1
| | | | | | | | | | | | | | | | | | | execution/datasources package #12178 ## What changes were proposed in this pull request? This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package. Note that to avoid conflicts, this PR is based on #12153. Effective changes for this PR only consist of the last three commits. Will rebase after merging #12153. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #12361 from liancheng/spark-14407-hide-hadoop-fs-relation.
* [SPARK-14564][ML][MLLIB][PYSPARK] Python Word2Vec missing setWindowSize methodJason Lee2016-04-181-1/+4
| | | | | | | | | | | | ## What changes were proposed in this pull request? Added windowSize getter/setter to ML/MLlib ## How was this patch tested? Added test cases in tests.py under both ML and MLlib Author: Jason Lee <cjlee@us.ibm.com> Closes #12428 from jasoncl/SPARK-14564.
* [SPARK-14306][ML][PYSPARK] PySpark ml.classification OneVsRest support ↵Xusen Yin2016-04-181-0/+7
| | | | | | | | | | | | | | | | | | export/import ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14306 Add PySpark OneVsRest save/load supports. ## How was this patch tested? Test with Python unit test. Author: Xusen Yin <yinxusen@gmail.com> Closes #12439 from yinxusen/SPARK-14306-0415.
* [MINOR] Remove inappropriate type notation and extra anonymous closure ↵hyukjinkwon2016-04-162-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | within functional transformations ## What changes were proposed in this pull request? This PR removes - Inappropriate type notations For example, from ```scala words.foreachRDD { (rdd: RDD[String], time: Time) => ... ``` to ```scala words.foreachRDD { (rdd, time) => ... ``` - Extra anonymous closure within functional transformations. For example, ```scala .map(item => { ... }) ``` which can be just simply as below: ```scala .map { item => ... } ``` and corrects some obvious style nits. ## How was this patch tested? This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly. The rules applied were below: - For the first correction, ```xml <check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters> </check> ``` ```xml <check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters> </check> ``` - For the second correction ```xml <check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters> </check> ``` **Those rules were not added** Author: hyukjinkwon <gurwls223@gmail.com> Closes #12413 from HyukjinKwon/SPARK-style.
* [SPARK-13925][ML][SPARKR] Expose R-like summary statistics in SparkR::glm ↵Yanbo Liang2016-04-151-6/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | for more family and link functions ## What changes were proposed in this pull request? Expose R-like summary statistics in SparkR::glm for more family and link functions. Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work. ## How was this patch tested? Unit tests. SparkR Output: ``` Deviance Residuals: (Note: These are approximate quantiles with relative error <= 0.01) Min 1Q Median 3Q Max -0.95096 -0.16585 -0.00232 0.17410 0.72918 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.6765 0.23536 7.1231 4.4561e-11 Sepal_Length 0.34988 0.046301 7.5566 4.1873e-12 Species_versicolor -0.98339 0.072075 -13.644 0 Species_virginica -1.0075 0.093306 -10.798 0 (Dispersion parameter for gaussian family taken to be 0.08351462) Null deviance: 28.307 on 149 degrees of freedom Residual deviance: 12.193 on 146 degrees of freedom AIC: 59.22 Number of Fisher Scoring iterations: 1 ``` R output: ``` Deviance Residuals: Min 1Q Median 3Q Max -0.95096 -0.16522 0.00171 0.18416 0.72918 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.67650 0.23536 7.123 4.46e-11 *** Sepal.Length 0.34988 0.04630 7.557 4.19e-12 *** Speciesversicolor -0.98339 0.07207 -13.644 < 2e-16 *** Speciesvirginica -1.00751 0.09331 -10.798 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 0.08351462) Null deviance: 28.307 on 149 degrees of freedom Residual deviance: 12.193 on 146 degrees of freedom AIC: 59.217 Number of Fisher Scoring iterations: 2 ``` cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #12393 from yanboliang/spark-13925.
* [SPARK-14370][MLLIB] removed duplicate generation of ids in OnlineLDAOptimizerPravin Gadakh2016-04-152-11/+10
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Removed duplicated generation of `ids` in OnlineLDAOptimizer. ## How was this patch tested? tested with existing unit tests. Author: Pravin Gadakh <prgadakh@in.ibm.com> Closes #12176 from pravingadakh/SPARK-14370.
* [SPARK-14549][ML] Copy the Vector and Matrix classes from mllib to ml in ↵DB Tsai2016-04-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mllib-local ## What changes were proposed in this pull request? This task will copy the Vector and Matrix classes from mllib to ml package in mllib-local jar. The UDTs and `since` annotation in ml vector and matrix will be removed from now. UDTs will be achieved by #SPARK-14487, and `since` will be replaced by /* since 1.2.0 */ The BLAS implementation will be copied, and some of the test utilities will be copies as well. Summary of changes: 1. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/BLAS.scala - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/BLAS.scala - logDebug("gemm: alpha is equal to 0 and beta is equal to 1. Returning C.") is removed in ml version. 2. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Matrices.scala - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Matrices.scala - `Since` was removed, and we'll use standard `/* Since /*` Java doc. Will be in another PR. - `UDT` related code was removed, and will use `SPARK-13944` https://github.com/apache/spark/pull/12259 to replace the annotation. 3. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Vectors.scala - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala - `Since` was removed. - `UDT` related code was removed. - In `def parseNumeric`, it was throwing `throw new SparkException(s"Cannot parse $other.")`, and now it's throwing `throw new IllegalArgumentException(s"Cannot parse $other.")` 4. In mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala - For consistency with ML version of vector, `def parseNumeric` is now throwing `throw new IllegalArgumentException(s"Cannot parse $other.")` 5. mllib/src/main/scala/org/apache/spark/**mllib**/util/NumericParser.scala is moved to mllib-local/src/main/scala/org/apache/spark/**ml**/util/NumericParser.scala - All the `throw new SparkException` were replaced by `throw new IllegalArgumentException` ## How was this patch tested? unit tests Author: DB Tsai <dbt@netflix.com> Closes #12317 from dbtsai/dbtsai-ml-vector.
* [SPARK-12869] Implemented an improved version of the toIndexedRowMatrixFokko Driesprong2016-04-142-8/+57
| | | | | | | | | | | | | | Hi guys, I've implemented an improved version of the `toIndexedRowMatrix` function on the `BlockMatrix`. I needed this for a project, but would like to share it with the rest of the community. In the case of dense matrices, it can increase performance up to 19 times: https://github.com/Fokko/BlockMatrixToIndexedRowMatrix If there are any questions or suggestions, please let me know. Keep up the good work! Cheers. Author: Fokko Driesprong <f.driesprong@catawiki.nl> Author: Fokko Driesprong <fokko@driesprongen.nl> Closes #10839 from Fokko/master.
* [SPARK-14565][ML] RandomForest should use parseInt and parseDouble for ↵Yong Tang2016-04-144-13/+25
| | | | | | | | | | | | | | | | | feature subset size instead of regexes ## What changes were proposed in this pull request? This fix tries to change RandomForest's supported strategies from using regexes to using parseInt and parseDouble, for the purpose of robustness and maintainability. ## How was this patch tested? Existing tests passed. Author: Yong Tang <yong.tang.github@outlook.com> Closes #12360 from yongtang/SPARK-14565.
* [SPARK-14618][ML][DOC] Updated RegressionEvaluator.metricName param docJoseph K. Bradley2016-04-141-4/+5
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? In Spark 1.4, we negated some metrics from RegressionEvaluator since CrossValidator always maximized metrics. This was fixed in 1.5, but the docs were not updated. This PR updates the docs. ## How was this patch tested? no tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12377 from jkbradley/regeval-doc.
* [SPARK-14612][ML] Consolidate the version of dependencies in mllib and ↵Sean Owen2016-04-141-13/+0
| | | | | | | | | | | | | | | | mllib-local into one place ## What changes were proposed in this pull request? Move json4s, breeze dependency declaration into parent ## How was this patch tested? Should be no functional change, but Jenkins tests will test that. Author: Sean Owen <sowen@cloudera.com> Closes #12390 from srowen/SPARK-14612.
* [SPARK-14375][ML] Unit test for spark.ml KMeansSummaryYanbo Liang2016-04-133-8/+47
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? * Modify ```KMeansSummary.clusterSizes``` method to make it robust to empty clusters. * Add unit test for spark.ml ```KMeansSummary```. * Add Since tag. ## How was this patch tested? unit tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12254 from yanboliang/spark-14375.
* [SPARK-14461][ML] GLM training summaries should provide solverYanbo Liang2016-04-132-3/+11
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? GLM training summaries should provide solver. ## How was this patch tested? Unit tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12253 from yanboliang/spark-14461.