diff options
author | Yanbo Liang <ybliang8@gmail.com> | 2015-07-06 16:15:12 -0700 |
---|---|---|
committer | Xiangrui Meng <meng@databricks.com> | 2015-07-06 16:15:12 -0700 |
commit | 0effe180f4c2cf37af1012b33b43912bdecaf756 (patch) | |
tree | af6542ca78aac976f775aeffa8e8af082a93ea7e /python/pyspark/mllib/clustering.py | |
parent | 96c5eeec3970e8b1ebc6ddf5c97a7acc47f539dc (diff) | |
download | spark-0effe180f4c2cf37af1012b33b43912bdecaf756.tar.gz spark-0effe180f4c2cf37af1012b33b43912bdecaf756.tar.bz2 spark-0effe180f4c2cf37af1012b33b43912bdecaf756.zip |
[SPARK-8765] [MLLIB] Fix PySpark PowerIterationClustering test issue
PySpark PowerIterationClustering test failure due to bad demo data.
If the data is small, PowerIterationClustering will behavior indeterministic.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #7177 from yanboliang/spark-8765 and squashes the following commits:
392ae54 [Yanbo Liang] fix model.assignments output
5ec3f1e [Yanbo Liang] fix PySpark PowerIterationClustering test issue
Diffstat (limited to 'python/pyspark/mllib/clustering.py')
-rw-r--r-- | python/pyspark/mllib/clustering.py | 16 |
1 files changed, 14 insertions, 2 deletions
diff --git a/python/pyspark/mllib/clustering.py b/python/pyspark/mllib/clustering.py index a3eab63528..ed4d78a2c6 100644 --- a/python/pyspark/mllib/clustering.py +++ b/python/pyspark/mllib/clustering.py @@ -282,18 +282,30 @@ class PowerIterationClusteringModel(JavaModelWrapper, JavaSaveable, JavaLoader): Model produced by [[PowerIterationClustering]]. - >>> data = [(0, 1, 1.0), (0, 2, 1.0), (1, 3, 1.0), (2, 3, 1.0), - ... (0, 3, 1.0), (1, 2, 1.0), (0, 4, 0.1)] + >>> data = [(0, 1, 1.0), (0, 2, 1.0), (0, 3, 1.0), (1, 2, 1.0), (1, 3, 1.0), + ... (2, 3, 1.0), (3, 4, 0.1), (4, 5, 1.0), (4, 15, 1.0), (5, 6, 1.0), + ... (6, 7, 1.0), (7, 8, 1.0), (8, 9, 1.0), (9, 10, 1.0), (10, 11, 1.0), + ... (11, 12, 1.0), (12, 13, 1.0), (13, 14, 1.0), (14, 15, 1.0)] >>> rdd = sc.parallelize(data, 2) >>> model = PowerIterationClustering.train(rdd, 2, 100) >>> model.k 2 + >>> result = sorted(model.assignments().collect(), key=lambda x: x.id) + >>> result[0].cluster == result[1].cluster == result[2].cluster == result[3].cluster + True + >>> result[4].cluster == result[5].cluster == result[6].cluster == result[7].cluster + True >>> import os, tempfile >>> path = tempfile.mkdtemp() >>> model.save(sc, path) >>> sameModel = PowerIterationClusteringModel.load(sc, path) >>> sameModel.k 2 + >>> result = sorted(model.assignments().collect(), key=lambda x: x.id) + >>> result[0].cluster == result[1].cluster == result[2].cluster == result[3].cluster + True + >>> result[4].cluster == result[5].cluster == result[6].cluster == result[7].cluster + True >>> from shutil import rmtree >>> try: ... rmtree(path) |