[SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD

Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley. ~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~ marmbrus jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #3070 from mengxr/SPARK-3573 and squashes the following commits: 3a0b6e5 [Xiangrui Meng] organize imports 236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples
author: Xiangrui Meng <meng@databricks.com> 2014-11-03 22:29:48 -0800
committer: Xiangrui Meng <meng@databricks.com> 2014-11-03 22:29:48 -0800
commit: 1a9c6cddadebdc53d083ac3e0da276ce979b5d1f (patch)
tree: b485818ba52a9287ae7124e57ef55f1d974f3a1f /examples/src/main/python
parent: 04450d11548cfb25d4fb77d4a33e3a7cd4254183 (diff)
download: spark-1a9c6cddadebdc53d083ac3e0da276ce979b5d1f.tar.gz
spark-1a9c6cddadebdc53d083ac3e0da276ce979b5d1f.tar.bz2
spark-1a9c6cddadebdc53d083ac3e0da276ce979b5d1f.zip
1 files changed, 62 insertions, 0 deletions
diff --git a/examples/src/main/python/mllib/dataset_example.py b/examples/src/main/python/mllib/dataset_example.py
new file mode 100644
index 0000000000..540dae785f
--- /dev/null
+++ b/examples/src/main/python/mllib/dataset_example.py
@@ -0,0 +1,62 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+An example of how to use SchemaRDD as a dataset for ML. Run with::
+    bin/spark-submit examples/src/main/python/mllib/dataset_example.py
+"""
+
+import os
+import sys
+import tempfile
+import shutil
+
+from pyspark import SparkContext
+from pyspark.sql import SQLContext
+from pyspark.mllib.util import MLUtils
+from pyspark.mllib.stat import Statistics
+
+
+def summarize(dataset):
+    print "schema: %s" % dataset.schema().json()
+    labels = dataset.map(lambda r: r.label)
+    print "label average: %f" % labels.mean()
+    features = dataset.map(lambda r: r.features)
+    summary = Statistics.colStats(features)
+    print "features average: %r" % summary.mean()
+
+if __name__ == "__main__":
+    if len(sys.argv) > 2:
+        print >> sys.stderr, "Usage: dataset_example.py <libsvm file>"
+        exit(-1)
+    sc = SparkContext(appName="DatasetExample")
+    sqlCtx = SQLContext(sc)
+    if len(sys.argv) == 2:
+        input = sys.argv[1]
+    else:
+        input = "data/mllib/sample_libsvm_data.txt"
+    points = MLUtils.loadLibSVMFile(sc, input)
+    dataset0 = sqlCtx.inferSchema(points).setName("dataset0").cache()
+    summarize(dataset0)
+    tempdir = tempfile.NamedTemporaryFile(delete=False).name
+    os.unlink(tempdir)
+    print "Save dataset as a Parquet file to %s." % tempdir
+    dataset0.saveAsParquetFile(tempdir)
+    print "Load it back and summarize it again."
+    dataset1 = sqlCtx.parquetFile(tempdir).setName("dataset1").cache()
+    summarize(dataset1)
+    shutil.rmtree(tempdir)
author	Xiangrui Meng <meng@databricks.com>	2014-11-03 22:29:48 -0800
committer	Xiangrui Meng <meng@databricks.com>	2014-11-03 22:29:48 -0800
commit	1a9c6cddadebdc53d083ac3e0da276ce979b5d1f (patch)
tree	b485818ba52a9287ae7124e57ef55f1d974f3a1f /examples/src/main/python
parent	04450d11548cfb25d4fb77d4a33e3a7cd4254183 (diff)
download	spark-1a9c6cddadebdc53d083ac3e0da276ce979b5d1f.tar.gz spark-1a9c6cddadebdc53d083ac3e0da276ce979b5d1f.tar.bz2 spark-1a9c6cddadebdc53d083ac3e0da276ce979b5d1f.zip