[SPARK-15018][PYSPARK][ML] Improve handling of PySpark Pipeline when used without stages

## What changes were proposed in this pull request? When fitting a PySpark Pipeline without the `stages` param set, a confusing NoneType error is raised as attempts to iterate over the pipeline stages. A pipeline with no stages should act as an identity transform, however the `stages` param still needs to be set to an empty list. This change improves the error output when the `stages` param is not set and adds a better description of what the API expects as input. Also minor cleanup of related code. ## How was this patch tested? Added new unit tests to verify an empty Pipeline acts as an identity transformer Author: Bryan Cutler <cutlerb@gmail.com> Closes #12790 from BryanCutler/pipeline-identity-SPARK-15018.
author: Bryan Cutler <cutlerb@gmail.com> 2016-08-19 23:46:36 -0700
committer: Yanbo Liang <ybliang8@gmail.com> 2016-08-19 23:46:36 -0700
commit: 39f328ba3519b01940a7d1cdee851ba4e75ef31f (patch)
tree: 467a209b875a76164d11c28c86f84b547fa3215e /python/pyspark/ml/tests.py
parent: 45d40d9f66c666eec6df926db23937589d67225d (diff)
download: spark-39f328ba3519b01940a7d1cdee851ba4e75ef31f.tar.gz
spark-39f328ba3519b01940a7d1cdee851ba4e75ef31f.tar.bz2
spark-39f328ba3519b01940a7d1cdee851ba4e75ef31f.zip
1 files changed, 11 insertions, 0 deletions
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 4bcb2c400c..6886ed321e 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -230,6 +230,17 @@ class PipelineTests(PySparkTestCase):
         self.assertEqual(5, transformer3.dataset_index)
         self.assertEqual(6, dataset.index)
 
+    def test_identity_pipeline(self):
+        dataset = MockDataset()
+
+        def doTransform(pipeline):
+            pipeline_model = pipeline.fit(dataset)
+            return pipeline_model.transform(dataset)
+        # check that empty pipeline did not perform any transformation
+        self.assertEqual(dataset.index, doTransform(Pipeline(stages=[])).index)
+        # check that failure to set stages param will raise KeyError for missing param
+        self.assertRaises(KeyError, lambda: doTransform(Pipeline()))
+
 
 class TestParams(HasMaxIter, HasInputCol, HasSeed):
     """
author	Bryan Cutler <cutlerb@gmail.com>	2016-08-19 23:46:36 -0700
committer	Yanbo Liang <ybliang8@gmail.com>	2016-08-19 23:46:36 -0700
commit	39f328ba3519b01940a7d1cdee851ba4e75ef31f (patch)
tree	467a209b875a76164d11c28c86f84b547fa3215e /python/pyspark/ml/tests.py
parent	45d40d9f66c666eec6df926db23937589d67225d (diff)
download	spark-39f328ba3519b01940a7d1cdee851ba4e75ef31f.tar.gz spark-39f328ba3519b01940a7d1cdee851ba4e75ef31f.tar.bz2 spark-39f328ba3519b01940a7d1cdee851ba4e75ef31f.zip