SPARK-4547 [MLLIB] OOM when making bins in BinaryClassificationMetrics - spark

diff options

author	Sean Owen <sowen@cloudera.com>	2014-12-31 13:37:04 -0800
committer	Xiangrui Meng <meng@databricks.com>	2014-12-31 13:37:04 -0800
commit	3d194cc75761fceba77b2c91291b36479b8b556c (patch)
tree	64f44f3965c52baf79b78944f498afc01028d697 /project
parent	8e14c5eb551ab06c94859c7f6d8c6b62b4d00d59 (diff)
download	spark-3d194cc75761fceba77b2c91291b36479b8b556c.tar.gz spark-3d194cc75761fceba77b2c91291b36479b8b556c.tar.bz2 spark-3d194cc75761fceba77b2c91291b36479b8b556c.zip

SPARK-4547 [MLLIB] OOM when making bins in BinaryClassificationMetrics

Now that I've implemented the basics here, I'm less convinced there is a need for this change, somehow. Callers can downsample before or after. Really the OOM is not in the ROC curve code, but in code that might `collect()` it for local analysis. Still, might be useful to down-sample since the ROC curve probably never needs millions of points. This is a first pass. Since the `(score,label)` are already grouped and sorted, I think it's sufficient to just take every Nth such pair, in order to downsample by a factor of N? this is just like retaining every Nth point on the curve, which I think is the goal. All of the data is still used to build the curve of course. What do you think about the API, and usefulness? Author: Sean Owen <sowen@cloudera.com> Closes #3702 from srowen/SPARK-4547 and squashes the following commits: 1d34d05 [Sean Owen] Indent and reorganize numBins scaladoc 692d825 [Sean Owen] Change handling of large numBins, make 2nd consturctor instead of optional param, style change a03610e [Sean Owen] Add downsamplingFactor to BinaryClassificationMetrics

Diffstat (limited to 'project')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: