[SPARK-7685] [ML] Apply weights to different samples in Logistic Regression

In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive. http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm. Author: DB Tsai <dbt@netflix.com> Author: DB Tsai <dbt@dbs-mac-pro.corp.netflix.com> Closes #7884 from dbtsai/SPARK-7685.
author: DB Tsai <dbt@netflix.com> 2015-09-15 15:46:47 -0700
committer: Xiangrui Meng <meng@databricks.com> 2015-09-15 15:46:47 -0700
commit: be52faa7c72fb4b95829f09a7dc5eb5dccd03524 (patch)
tree: 1fd30de5fdcf31c013774dca0ae06b834992900e /project/MimaExcludes.scala
parent: 31a229aa739b6d05ec6d91b820fcca79b6b7d6fe (diff)
download: spark-be52faa7c72fb4b95829f09a7dc5eb5dccd03524.tar.gz
spark-be52faa7c72fb4b95829f09a7dc5eb5dccd03524.tar.bz2
spark-be52faa7c72fb4b95829f09a7dc5eb5dccd03524.zip
1 files changed, 9 insertions, 1 deletions
diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index 87b141cd3b..46026c1e90 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -45,7 +45,15 @@ object MimaExcludes {
         excludePackage("org.apache.spark.sql.execution")
       ) ++
       MimaBuild.excludeSparkClass("streaming.flume.FlumeTestUtils") ++
-      MimaBuild.excludeSparkClass("streaming.flume.PollingFlumeTestUtils")
+      MimaBuild.excludeSparkClass("streaming.flume.PollingFlumeTestUtils") ++ 
+      Seq(
+        ProblemFilters.exclude[MissingMethodProblem](
+          "org.apache.spark.ml.classification.LogisticCostFun.this"),
+        ProblemFilters.exclude[MissingMethodProblem](
+          "org.apache.spark.ml.classification.LogisticAggregator.add"),
+        ProblemFilters.exclude[MissingMethodProblem](
+          "org.apache.spark.ml.classification.LogisticAggregator.count")
+      )
     case v if v.startsWith("1.5") =>
       Seq(
         MimaBuild.excludeSparkPackage("network"),
author	DB Tsai <dbt@netflix.com>	2015-09-15 15:46:47 -0700
committer	Xiangrui Meng <meng@databricks.com>	2015-09-15 15:46:47 -0700
commit	be52faa7c72fb4b95829f09a7dc5eb5dccd03524 (patch)
tree	1fd30de5fdcf31c013774dca0ae06b834992900e /project/MimaExcludes.scala
parent	31a229aa739b6d05ec6d91b820fcca79b6b7d6fe (diff)
download	spark-be52faa7c72fb4b95829f09a7dc5eb5dccd03524.tar.gz spark-be52faa7c72fb4b95829f09a7dc5eb5dccd03524.tar.bz2 spark-be52faa7c72fb4b95829f09a7dc5eb5dccd03524.zip