diff options
author | Sean Owen <sowen@cloudera.com> | 2014-12-31 13:37:04 -0800 |
---|---|---|
committer | Xiangrui Meng <meng@databricks.com> | 2014-12-31 13:37:04 -0800 |
commit | 3d194cc75761fceba77b2c91291b36479b8b556c (patch) | |
tree | 64f44f3965c52baf79b78944f498afc01028d697 /project/spark-style/src/main | |
parent | 8e14c5eb551ab06c94859c7f6d8c6b62b4d00d59 (diff) | |
download | spark-3d194cc75761fceba77b2c91291b36479b8b556c.tar.gz spark-3d194cc75761fceba77b2c91291b36479b8b556c.tar.bz2 spark-3d194cc75761fceba77b2c91291b36479b8b556c.zip |
SPARK-4547 [MLLIB] OOM when making bins in BinaryClassificationMetrics
Now that I've implemented the basics here, I'm less convinced there is a need for this change, somehow. Callers can downsample before or after. Really the OOM is not in the ROC curve code, but in code that might `collect()` it for local analysis. Still, might be useful to down-sample since the ROC curve probably never needs millions of points.
This is a first pass. Since the `(score,label)` are already grouped and sorted, I think it's sufficient to just take every Nth such pair, in order to downsample by a factor of N? this is just like retaining every Nth point on the curve, which I think is the goal. All of the data is still used to build the curve of course.
What do you think about the API, and usefulness?
Author: Sean Owen <sowen@cloudera.com>
Closes #3702 from srowen/SPARK-4547 and squashes the following commits:
1d34d05 [Sean Owen] Indent and reorganize numBins scaladoc
692d825 [Sean Owen] Change handling of large numBins, make 2nd consturctor instead of optional param, style change
a03610e [Sean Owen] Add downsamplingFactor to BinaryClassificationMetrics
Diffstat (limited to 'project/spark-style/src/main')
0 files changed, 0 insertions, 0 deletions