diff options
author | Reynold Xin <rxin@databricks.com> | 2015-05-15 22:00:31 -0700 |
---|---|---|
committer | Reynold Xin <rxin@databricks.com> | 2015-05-15 22:00:31 -0700 |
commit | 578bfeeff514228f6fd4b07a536815fbb3510f7e (patch) | |
tree | 97964df2b0b7ada4f019f2cd9617ba6af1d59f52 /examples/src/main/scala | |
parent | deb411335a09b91eb1f75421d77e1c3686719621 (diff) | |
download | spark-578bfeeff514228f6fd4b07a536815fbb3510f7e.tar.gz spark-578bfeeff514228f6fd4b07a536815fbb3510f7e.tar.bz2 spark-578bfeeff514228f6fd4b07a536815fbb3510f7e.zip |
[SPARK-7654][SQL] DataFrameReader and DataFrameWriter for input/output API
This patch introduces DataFrameWriter and DataFrameReader.
DataFrameReader interface, accessible through SQLContext.read, contains methods that create DataFrames. These methods used to reside in SQLContext. Example usage:
```scala
sqlContext.read.json("...")
sqlContext.read.parquet("...")
```
DataFrameWriter interface, accessible through DataFrame.write, implements a builder pattern to avoid the proliferation of options in writing DataFrame out. It currently implements:
- mode
- format (e.g. "parquet", "json")
- options (generic options passed down into data sources)
- partitionBy (partitioning columns)
Example usage:
```scala
df.write.mode("append").format("json").partitionBy("date").saveAsTable("myJsonTable")
```
TODO:
- [ ] Documentation update
- [ ] Move JDBC into reader / writer?
- [ ] Deprecate the old interfaces
- [ ] Move the generic load interface into reader.
- [ ] Update example code and documentation
Author: Reynold Xin <rxin@databricks.com>
Closes #6175 from rxin/reader-writer and squashes the following commits:
b146c95 [Reynold Xin] Deprecation of old APIs.
bd8abdf [Reynold Xin] Fixed merge conflict.
26abea2 [Reynold Xin] Added general load methods.
244fbec [Reynold Xin] Added equivalent to example.
4f15d92 [Reynold Xin] Added documentation for partitionBy.
7e91611 [Reynold Xin] [SPARK-7654][SQL] DataFrameReader and DataFrameWriter for input/output API.
Diffstat (limited to 'examples/src/main/scala')
-rw-r--r-- | examples/src/main/scala/org/apache/spark/examples/mllib/DatasetExample.scala | 2 | ||||
-rw-r--r-- | examples/src/main/scala/org/apache/spark/examples/sql/RDDRelation.scala | 2 |
2 files changed, 2 insertions, 2 deletions
diff --git a/examples/src/main/scala/org/apache/spark/examples/mllib/DatasetExample.scala b/examples/src/main/scala/org/apache/spark/examples/mllib/DatasetExample.scala index e943d6c889..c95cca7d65 100644 --- a/examples/src/main/scala/org/apache/spark/examples/mllib/DatasetExample.scala +++ b/examples/src/main/scala/org/apache/spark/examples/mllib/DatasetExample.scala @@ -106,7 +106,7 @@ object DatasetExample { df.saveAsParquetFile(outputDir) println(s"Loading Parquet file with UDT from $outputDir.") - val newDataset = sqlContext.parquetFile(outputDir) + val newDataset = sqlContext.read.parquet(outputDir) println(s"Schema from Parquet: ${newDataset.schema.prettyJson}") val newFeatures = newDataset.select("features").map { case Row(v: Vector) => v } diff --git a/examples/src/main/scala/org/apache/spark/examples/sql/RDDRelation.scala b/examples/src/main/scala/org/apache/spark/examples/sql/RDDRelation.scala index 6331d1c006..acc89199d5 100644 --- a/examples/src/main/scala/org/apache/spark/examples/sql/RDDRelation.scala +++ b/examples/src/main/scala/org/apache/spark/examples/sql/RDDRelation.scala @@ -61,7 +61,7 @@ object RDDRelation { df.saveAsParquetFile("pair.parquet") // Read in parquet file. Parquet files are self-describing so the schmema is preserved. - val parquetFile = sqlContext.parquetFile("pair.parquet") + val parquetFile = sqlContext.read.parquet("pair.parquet") // Queries can be run using the DSL on parequet files just like the original RDD. parquetFile.where($"key" === 1).select($"value".as("a")).collect().foreach(println) |