[SPARK-11876][SQL] Support printSchema in DataSet API

DataSet APIs look great! However, I am lost when doing multiple level joins. For example, ``` val ds1 = Seq(("a", 1), ("b", 2)).toDS().as("a") val ds2 = Seq(("a", 1), ("b", 2)).toDS().as("b") val ds3 = Seq(("a", 1), ("b", 2)).toDS().as("c") ds1.joinWith(ds2, $"a._2" === $"b._2").as("ab").joinWith(ds3, $"ab._1._2" === $"c._2").printSchema() ``` The printed schema is like ``` root |-- _1: struct (nullable = true) | |-- _1: struct (nullable = true) | | |-- _1: string (nullable = true) | | |-- _2: integer (nullable = true) | |-- _2: struct (nullable = true) | | |-- _1: string (nullable = true) | | |-- _2: integer (nullable = true) |-- _2: struct (nullable = true) | |-- _1: string (nullable = true) | |-- _2: integer (nullable = true) ``` Personally, I think we need the printSchema function. Sometimes, I do not know how to specify the column, especially when their data types are mixed. For example, if I want to write the following select for the above multi-level join, I have to know the schema: ``` newDS.select(expr("_1._2._2 + 1").as[Int]).collect() ``` marmbrus rxin cloud-fan Do you have the same feeling? Author: gatorsmile <gatorsmile@gmail.com> Closes #9855 from gatorsmile/printSchemaDataSet.
author: gatorsmile <gatorsmile@gmail.com> 2015-11-20 11:20:47 -0800
committer: Michael Armbrust <michael@databricks.com> 2015-11-20 11:20:47 -0800
commit: bef361c589c0a38740232fd8d0a45841e4fc969a (patch)
tree: 04c35fdef5eea65c9a8d77cff00171de0b2f2344
parent: e359d5dcf5bd300213054ebeae9fe75c4f7eb9e7 (diff)
download: spark-bef361c589c0a38740232fd8d0a45841e4fc969a.tar.gz
spark-bef361c589c0a38740232fd8d0a45841e4fc969a.tar.bz2
spark-bef361c589c0a38740232fd8d0a45841e4fc969a.zip
2 files changed, 9 insertions, 9 deletions
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
index 98358127e2..7abcecaa28 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
@@ -300,15 +300,6 @@ class DataFrame private[sql](
   def columns: Array[String] = schema.fields.map(_.name)
 
   /**
-   * Prints the schema to the console in a nice tree format.
-   * @group basic
-   * @since 1.3.0
-   */
-  // scalastyle:off println
-  def printSchema(): Unit = println(schema.treeString)
-  // scalastyle:on println
-
-  /**
    * Returns true if the `collect` and `take` methods can be run locally
    * (without any Spark executors).
    * @group basic
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/Queryable.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/Queryable.scala
index e86a52c149..321e2c7835 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/Queryable.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/Queryable.scala
@@ -38,6 +38,15 @@ private[sql] trait Queryable {
   }
 
   /**
+   * Prints the schema to the console in a nice tree format.
+   * @group basic
+   * @since 1.3.0
+   */
+  // scalastyle:off println
+  def printSchema(): Unit = println(schema.treeString)
+  // scalastyle:on println
+
+  /**
    * Prints the plans (logical and physical) to the console for debugging purposes.
    * @since 1.3.0
    */
author	gatorsmile <gatorsmile@gmail.com>	2015-11-20 11:20:47 -0800
committer	Michael Armbrust <michael@databricks.com>	2015-11-20 11:20:47 -0800
commit	bef361c589c0a38740232fd8d0a45841e4fc969a (patch)
tree	04c35fdef5eea65c9a8d77cff00171de0b2f2344
parent	e359d5dcf5bd300213054ebeae9fe75c4f7eb9e7 (diff)
download	spark-bef361c589c0a38740232fd8d0a45841e4fc969a.tar.gz spark-bef361c589c0a38740232fd8d0a45841e4fc969a.tar.bz2 spark-bef361c589c0a38740232fd8d0a45841e4fc969a.zip