aboutsummaryrefslogtreecommitdiff
path: root/project
diff options
context:
space:
mode:
authorSean Owen <sowen@cloudera.com>2014-12-31 13:37:04 -0800
committerXiangrui Meng <meng@databricks.com>2014-12-31 13:37:04 -0800
commit3d194cc75761fceba77b2c91291b36479b8b556c (patch)
tree64f44f3965c52baf79b78944f498afc01028d697 /project
parent8e14c5eb551ab06c94859c7f6d8c6b62b4d00d59 (diff)
downloadspark-3d194cc75761fceba77b2c91291b36479b8b556c.tar.gz
spark-3d194cc75761fceba77b2c91291b36479b8b556c.tar.bz2
spark-3d194cc75761fceba77b2c91291b36479b8b556c.zip
SPARK-4547 [MLLIB] OOM when making bins in BinaryClassificationMetrics
Now that I've implemented the basics here, I'm less convinced there is a need for this change, somehow. Callers can downsample before or after. Really the OOM is not in the ROC curve code, but in code that might `collect()` it for local analysis. Still, might be useful to down-sample since the ROC curve probably never needs millions of points. This is a first pass. Since the `(score,label)` are already grouped and sorted, I think it's sufficient to just take every Nth such pair, in order to downsample by a factor of N? this is just like retaining every Nth point on the curve, which I think is the goal. All of the data is still used to build the curve of course. What do you think about the API, and usefulness? Author: Sean Owen <sowen@cloudera.com> Closes #3702 from srowen/SPARK-4547 and squashes the following commits: 1d34d05 [Sean Owen] Indent and reorganize numBins scaladoc 692d825 [Sean Owen] Change handling of large numBins, make 2nd consturctor instead of optional param, style change a03610e [Sean Owen] Add downsamplingFactor to BinaryClassificationMetrics
Diffstat (limited to 'project')
0 files changed, 0 insertions, 0 deletions