diff options
author | hyukjinkwon <gurwls223@gmail.com> | 2017-01-10 02:18:07 -0800 |
---|---|---|
committer | Yanbo Liang <ybliang8@gmail.com> | 2017-01-10 02:18:07 -0800 |
commit | b0e5840d4b37d7b73e300671795185bba37effb0 (patch) | |
tree | ff941b0f960778dd592333b12c523e2722545843 /examples/src | |
parent | 3ef6d98a803fdff182ab4556c3273ec5fa0ff002 (diff) | |
download | spark-b0e5840d4b37d7b73e300671795185bba37effb0.tar.gz spark-b0e5840d4b37d7b73e300671795185bba37effb0.tar.bz2 spark-b0e5840d4b37d7b73e300671795185bba37effb0.zip |
[SPARK-19134][EXAMPLE] Fix several sql, mllib and status api examples not working
## What changes were proposed in this pull request?
**binary_classification_metrics_example.py**
LibSVM datasource loads `ml.linalg.SparseVector` whereas the example requires it to be `mllib.linalg.SparseVector`. For the equivalent Scala exmaple, `BinaryClassificationMetricsExample.scala` seems fine.
```
./bin/spark-submit examples/src/main/python/mllib/binary_classification_metrics_example.py
```
```
File ".../spark/examples/src/main/python/mllib/binary_classification_metrics_example.py", line 39, in <lambda>
.rdd.map(lambda row: LabeledPoint(row[0], row[1]))
File ".../spark/python/pyspark/mllib/regression.py", line 54, in __init__
self.features = _convert_to_vector(features)
File ".../spark/python/pyspark/mllib/linalg/__init__.py", line 80, in _convert_to_vector
raise TypeError("Cannot convert type %s into Vector" % type(l))
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector
```
**status_api_demo.py** (this one does not work on Python 3.4.6)
It's `queue` in Python 3+.
```
PYSPARK_PYTHON=python3 ./bin/spark-submit examples/src/main/python/status_api_demo.py
```
```
Traceback (most recent call last):
File ".../spark/examples/src/main/python/status_api_demo.py", line 22, in <module>
import Queue
ImportError: No module named 'Queue'
```
**bisecting_k_means_example.py**
`BisectingKMeansModel` does not implement `save` and `load` in Python.
```bash
./bin/spark-submit examples/src/main/python/mllib/bisecting_k_means_example.py
```
```
Traceback (most recent call last):
File ".../spark/examples/src/main/python/mllib/bisecting_k_means_example.py", line 46, in <module>
model.save(sc, path)
AttributeError: 'BisectingKMeansModel' object has no attribute 'save'
```
**elementwise_product_example.py**
It calls `collect` from the vector.
```bash
./bin/spark-submit examples/src/main/python/mllib/elementwise_product_example.py
```
```
Traceback (most recent call last):
File ".../spark/examples/src/main/python/mllib/elementwise_product_example.py", line 48, in <module>
for each in transformedData2.collect():
File ".../spark/python/pyspark/mllib/linalg/__init__.py", line 478, in __getattr__
return getattr(self.array, item)
AttributeError: 'numpy.ndarray' object has no attribute 'collect'
```
**These three tests look throwing an exception for a relative path set in `spark.sql.warehouse.dir`.**
**hive.py**
```
./bin/spark-submit examples/src/main/python/sql/hive.py
```
```
Traceback (most recent call last):
File ".../spark/examples/src/main/python/sql/hive.py", line 47, in <module>
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
File ".../spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 541, in sql
File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File ".../spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse);'
```
**SparkHiveExample.scala**
```
./bin/run-example sql.hive.SparkHiveExample
```
```
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse
at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:498)
at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484)
at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1668)
```
**JavaSparkHiveExample.java**
```
./bin/run-example sql.hive.JavaSparkHiveExample
```
```
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse
at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:498)
at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484)
at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1668)
```
## How was this patch tested?
Manually via
```
./bin/spark-submit examples/src/main/python/mllib/binary_classification_metrics_example.py
```
```
PYSPARK_PYTHON=python3 ./bin/spark-submit examples/src/main/python/status_api_demo.py
```
```
./bin/spark-submit examples/src/main/python/mllib/bisecting_k_means_example.py
```
```
./bin/spark-submit examples/src/main/python/mllib/elementwise_product_example.py
```
```
./bin/spark-submit examples/src/main/python/sql/hive.py
```
```
./bin/run-example sql.hive.JavaSparkHiveExample
```
```
./bin/run-example sql.hive.SparkHiveExample
```
These were found via
```bash
find ./examples/src/main/python -name "*.py" -exec spark-submit {} \;
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #16515 from HyukjinKwon/minor-example-fix.
Diffstat (limited to 'examples/src')
7 files changed, 18 insertions, 21 deletions
diff --git a/examples/src/main/java/org/apache/spark/examples/sql/hive/JavaSparkHiveExample.java b/examples/src/main/java/org/apache/spark/examples/sql/hive/JavaSparkHiveExample.java index 8d06d38cf2..2fe1307d8e 100644 --- a/examples/src/main/java/org/apache/spark/examples/sql/hive/JavaSparkHiveExample.java +++ b/examples/src/main/java/org/apache/spark/examples/sql/hive/JavaSparkHiveExample.java @@ -17,6 +17,7 @@ package org.apache.spark.examples.sql.hive; // $example on:spark_hive$ +import java.io.File; import java.io.Serializable; import java.util.ArrayList; import java.util.List; @@ -56,7 +57,7 @@ public class JavaSparkHiveExample { public static void main(String[] args) { // $example on:spark_hive$ // warehouseLocation points to the default location for managed databases and tables - String warehouseLocation = "spark-warehouse"; + String warehouseLocation = new File("spark-warehouse").getAbsolutePath(); SparkSession spark = SparkSession .builder() .appName("Java Spark Hive Example") diff --git a/examples/src/main/python/mllib/binary_classification_metrics_example.py b/examples/src/main/python/mllib/binary_classification_metrics_example.py index 91f8378f29..d14ce7982e 100644 --- a/examples/src/main/python/mllib/binary_classification_metrics_example.py +++ b/examples/src/main/python/mllib/binary_classification_metrics_example.py @@ -18,25 +18,20 @@ Binary Classification Metrics Example. """ from __future__ import print_function -from pyspark.sql import SparkSession +from pyspark import SparkContext # $example on$ from pyspark.mllib.classification import LogisticRegressionWithLBFGS from pyspark.mllib.evaluation import BinaryClassificationMetrics -from pyspark.mllib.regression import LabeledPoint +from pyspark.mllib.util import MLUtils # $example off$ if __name__ == "__main__": - spark = SparkSession\ - .builder\ - .appName("BinaryClassificationMetricsExample")\ - .getOrCreate() + sc = SparkContext(appName="BinaryClassificationMetricsExample") # $example on$ # Several of the methods available in scala are currently missing from pyspark # Load training data in LIBSVM format - data = spark\ - .read.format("libsvm").load("data/mllib/sample_binary_classification_data.txt")\ - .rdd.map(lambda row: LabeledPoint(row[0], row[1])) + data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_binary_classification_data.txt") # Split data into training (60%) and test (40%) training, test = data.randomSplit([0.6, 0.4], seed=11) @@ -58,4 +53,4 @@ if __name__ == "__main__": print("Area under ROC = %s" % metrics.areaUnderROC) # $example off$ - spark.stop() + sc.stop() diff --git a/examples/src/main/python/mllib/bisecting_k_means_example.py b/examples/src/main/python/mllib/bisecting_k_means_example.py index 7f4d0402d6..31f3e72d7f 100644 --- a/examples/src/main/python/mllib/bisecting_k_means_example.py +++ b/examples/src/main/python/mllib/bisecting_k_means_example.py @@ -40,11 +40,6 @@ if __name__ == "__main__": # Evaluate clustering cost = model.computeCost(parsedData) print("Bisecting K-means Cost = " + str(cost)) - - # Save and load model - path = "target/org/apache/spark/PythonBisectingKMeansExample/BisectingKMeansModel" - model.save(sc, path) - sameModel = BisectingKMeansModel.load(sc, path) # $example off$ sc.stop() diff --git a/examples/src/main/python/mllib/elementwise_product_example.py b/examples/src/main/python/mllib/elementwise_product_example.py index 6d8bf6d42e..8ae9afb1dc 100644 --- a/examples/src/main/python/mllib/elementwise_product_example.py +++ b/examples/src/main/python/mllib/elementwise_product_example.py @@ -45,7 +45,7 @@ if __name__ == "__main__": print(each) print("transformedData2:") - for each in transformedData2.collect(): + for each in transformedData2: print(each) sc.stop() diff --git a/examples/src/main/python/sql/hive.py b/examples/src/main/python/sql/hive.py index ba01544a5b..1f175d7258 100644 --- a/examples/src/main/python/sql/hive.py +++ b/examples/src/main/python/sql/hive.py @@ -18,7 +18,7 @@ from __future__ import print_function # $example on:spark_hive$ -from os.path import expanduser, join +from os.path import expanduser, join, abspath from pyspark.sql import SparkSession from pyspark.sql import Row @@ -34,7 +34,7 @@ Run with: if __name__ == "__main__": # $example on:spark_hive$ # warehouse_location points to the default location for managed databases and tables - warehouse_location = 'spark-warehouse' + warehouse_location = abspath('spark-warehouse') spark = SparkSession \ .builder \ diff --git a/examples/src/main/python/status_api_demo.py b/examples/src/main/python/status_api_demo.py index 49b7902185..8cc8cc820c 100644 --- a/examples/src/main/python/status_api_demo.py +++ b/examples/src/main/python/status_api_demo.py @@ -19,7 +19,11 @@ from __future__ import print_function import time import threading -import Queue +import sys +if sys.version >= '3': + import queue as Queue +else: + import Queue from pyspark import SparkConf, SparkContext diff --git a/examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala b/examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala index d29ed958fe..3de26364b5 100644 --- a/examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala +++ b/examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala @@ -17,6 +17,8 @@ package org.apache.spark.examples.sql.hive // $example on:spark_hive$ +import java.io.File + import org.apache.spark.sql.Row import org.apache.spark.sql.SparkSession // $example off:spark_hive$ @@ -38,7 +40,7 @@ object SparkHiveExample { // $example on:spark_hive$ // warehouseLocation points to the default location for managed databases and tables - val warehouseLocation = "spark-warehouse" + val warehouseLocation = new File("spark-warehouse").getAbsolutePath val spark = SparkSession .builder() |