[SPARK-7559] [MLLIB] Bucketizer should include the right most boundary in the last bucket. - spark

diff options

author	Xiangrui Meng <meng@databricks.com>	2015-05-12 14:24:26 -0700
committer	Joseph K. Bradley <joseph@databricks.com>	2015-05-12 14:24:33 -0700
commit	98ccd934f3402af944457d839fd2e316059367f5 (patch)
tree	671b2a79b628eb45bea36251a7ed2fee10dfcaad /sql
parent	c68485e7a77ac3225d563f1da2a94f9cc691ac61 (diff)
download	spark-98ccd934f3402af944457d839fd2e316059367f5.tar.gz spark-98ccd934f3402af944457d839fd2e316059367f5.tar.bz2 spark-98ccd934f3402af944457d839fd2e316059367f5.zip

[SPARK-7559] [MLLIB] Bucketizer should include the right most boundary in the last bucket.

We make special treatment for +inf in `Bucketizer`. This could be simplified by always including the largest split value in the last bucket. E.g., (x1, x2, x3) defines buckets [x1, x2) and [x2, x3]. This shouldn't affect user code much, and there are applications that need to include the right-most value. For example, we can bucketize ratings from 0 to 10 to bad, neutral, and good with splits 0, 4, 6, 10. It may reads weird if the users need to put 0, 4, 6, 10.1 (or 11). This also update the impl to use `Arrays.binarySearch` and `withClue` in test. yinxusen jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6075 from mengxr/SPARK-7559 and squashes the following commits: e28f910 [Xiangrui Meng] update bucketizer impl (cherry picked from commit 23b9863e2aa7ecd0c4fa3aa8a59fdae09b4fe1d7) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

Diffstat (limited to 'sql')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: