diff options
author | Xiangrui Meng <meng@databricks.com> | 2015-05-04 11:28:59 -0700 |
---|---|---|
committer | Xiangrui Meng <meng@databricks.com> | 2015-05-04 11:28:59 -0700 |
commit | e0833c5958bbd73ff27cfe6865648d7b6e5a99bc (patch) | |
tree | 373883fa46f206ffcd34c4d0b67ce246b61bbc93 /python/pyspark | |
parent | 5a1a1075a607be683f008ef92fa227803370c45f (diff) | |
download | spark-e0833c5958bbd73ff27cfe6865648d7b6e5a99bc.tar.gz spark-e0833c5958bbd73ff27cfe6865648d7b6e5a99bc.tar.bz2 spark-e0833c5958bbd73ff27cfe6865648d7b6e5a99bc.zip |
[SPARK-5956] [MLLIB] Pipeline components should be copyable.
This PR added `copy(extra: ParamMap): Params` to `Params`, which makes a copy of the current instance with a randomly generated uid and some extra param values. With this change, we only need to implement `fit` and `transform` without extra param values given the default implementation of `fit(dataset, extra)`:
~~~scala
def fit(dataset: DataFrame, extra: ParamMap): Model = {
copy(extra).fit(dataset)
}
~~~
Inside `fit` and `transform`, since only the embedded values are used, I added `$` as an alias for `getOrDefault` to make the code easier to read. For example, in `LinearRegression.fit` we have:
~~~scala
val effectiveRegParam = $(regParam) / yStd
val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam
~~~
Meta-algorithm like `Pipeline` implements its own `copy(extra)`. So the fitted pipeline model stored all copied stages (no matter whether it is a transformer or a model).
Other changes:
* `Params$.inheritValues` is moved to `Params!.copyValues` and returns the target instance.
* `fittingParamMap` was removed because the `parent` carries this information.
* `validate` was renamed to `validateParams` to be more precise.
TODOs:
* [x] add tests for newly added methods
* [ ] update documentation
jkbradley dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes #5820 from mengxr/SPARK-5956 and squashes the following commits:
7bef88d [Xiangrui Meng] address comments
05229c3 [Xiangrui Meng] assert -> assertEquals
b2927b1 [Xiangrui Meng] organize imports
f14456b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
93e7924 [Xiangrui Meng] add tests for hasParam & copy
463ecae [Xiangrui Meng] merge master
2b954c3 [Xiangrui Meng] update Binarizer
465dd12 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
282a1a8 [Xiangrui Meng] fix test
819dd2d [Xiangrui Meng] merge master
b642872 [Xiangrui Meng] example code runs
5a67779 [Xiangrui Meng] examples compile
c76b4d1 [Xiangrui Meng] fix all unit tests
0f4fd64 [Xiangrui Meng] fix some tests
9286a22 [Xiangrui Meng] copyValues to trained models
53e0973 [Xiangrui Meng] move inheritValues to Params and rename it to copyValues
9ee004e [Xiangrui Meng] merge copy and copyWith; rename validate to validateParams
d882afc [Xiangrui Meng] test compile
f082a31 [Xiangrui Meng] make Params copyable and simply handling of extra params in all spark.ml components
Diffstat (limited to 'python/pyspark')
0 files changed, 0 insertions, 0 deletions