diff options
author | gatorsmile <gatorsmile@gmail.com> | 2015-11-20 11:20:47 -0800 |
---|---|---|
committer | Michael Armbrust <michael@databricks.com> | 2015-11-20 11:20:47 -0800 |
commit | bef361c589c0a38740232fd8d0a45841e4fc969a (patch) | |
tree | 04c35fdef5eea65c9a8d77cff00171de0b2f2344 /sql | |
parent | e359d5dcf5bd300213054ebeae9fe75c4f7eb9e7 (diff) | |
download | spark-bef361c589c0a38740232fd8d0a45841e4fc969a.tar.gz spark-bef361c589c0a38740232fd8d0a45841e4fc969a.tar.bz2 spark-bef361c589c0a38740232fd8d0a45841e4fc969a.zip |
[SPARK-11876][SQL] Support printSchema in DataSet API
DataSet APIs look great! However, I am lost when doing multiple level joins. For example,
```
val ds1 = Seq(("a", 1), ("b", 2)).toDS().as("a")
val ds2 = Seq(("a", 1), ("b", 2)).toDS().as("b")
val ds3 = Seq(("a", 1), ("b", 2)).toDS().as("c")
ds1.joinWith(ds2, $"a._2" === $"b._2").as("ab").joinWith(ds3, $"ab._1._2" === $"c._2").printSchema()
```
The printed schema is like
```
root
|-- _1: struct (nullable = true)
| |-- _1: struct (nullable = true)
| | |-- _1: string (nullable = true)
| | |-- _2: integer (nullable = true)
| |-- _2: struct (nullable = true)
| | |-- _1: string (nullable = true)
| | |-- _2: integer (nullable = true)
|-- _2: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: integer (nullable = true)
```
Personally, I think we need the printSchema function. Sometimes, I do not know how to specify the column, especially when their data types are mixed. For example, if I want to write the following select for the above multi-level join, I have to know the schema:
```
newDS.select(expr("_1._2._2 + 1").as[Int]).collect()
```
marmbrus rxin cloud-fan Do you have the same feeling?
Author: gatorsmile <gatorsmile@gmail.com>
Closes #9855 from gatorsmile/printSchemaDataSet.
Diffstat (limited to 'sql')
-rw-r--r-- | sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala | 9 | ||||
-rw-r--r-- | sql/core/src/main/scala/org/apache/spark/sql/execution/Queryable.scala | 9 |
2 files changed, 9 insertions, 9 deletions
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala index 98358127e2..7abcecaa28 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala @@ -300,15 +300,6 @@ class DataFrame private[sql]( def columns: Array[String] = schema.fields.map(_.name) /** - * Prints the schema to the console in a nice tree format. - * @group basic - * @since 1.3.0 - */ - // scalastyle:off println - def printSchema(): Unit = println(schema.treeString) - // scalastyle:on println - - /** * Returns true if the `collect` and `take` methods can be run locally * (without any Spark executors). * @group basic diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/Queryable.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/Queryable.scala index e86a52c149..321e2c7835 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/Queryable.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/Queryable.scala @@ -38,6 +38,15 @@ private[sql] trait Queryable { } /** + * Prints the schema to the console in a nice tree format. + * @group basic + * @since 1.3.0 + */ + // scalastyle:off println + def printSchema(): Unit = println(schema.treeString) + // scalastyle:on println + + /** * Prints the plans (logical and physical) to the console for debugging purposes. * @since 1.3.0 */ |