diff options
Diffstat (limited to 'docs/sql-programming-guide.md')
-rw-r--r-- | docs/sql-programming-guide.md | 48 |
1 files changed, 5 insertions, 43 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 2fdc97f8a0..2d9849d032 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -1467,37 +1467,6 @@ Configuration of Parquet can be done using the `setConf` method on `SQLContext` </td> </tr> <tr> - <td><code>spark.sql.parquet.output.committer.class</code></td> - <td><code>org.apache.parquet.hadoop.<br />ParquetOutputCommitter</code></td> - <td> - <p> - The output committer class used by Parquet. The specified class needs to be a subclass of - <code>org.apache.hadoop.<br />mapreduce.OutputCommitter</code>. Typically, it's also a - subclass of <code>org.apache.parquet.hadoop.ParquetOutputCommitter</code>. - </p> - <p> - <b>Note:</b> - <ul> - <li> - This option is automatically ignored if <code>spark.speculation</code> is turned on. - </li> - <li> - This option must be set via Hadoop <code>Configuration</code> rather than Spark - <code>SQLConf</code>. - </li> - <li> - This option overrides <code>spark.sql.sources.<br />outputCommitterClass</code>. - </li> - </ul> - </p> - <p> - Spark SQL comes with a builtin - <code>org.apache.spark.sql.<br />parquet.DirectParquetOutputCommitter</code>, which can be more - efficient then the default Parquet output committer when writing data to S3. - </p> - </td> -</tr> -<tr> <td><code>spark.sql.parquet.mergeSchema</code></td> <td><code>false</code></td> <td> @@ -1533,7 +1502,7 @@ val people = sqlContext.read.json(path) // The inferred schema can be visualized using the printSchema() method. people.printSchema() // root -// |-- age: integer (nullable = true) +// |-- age: long (nullable = true) // |-- name: string (nullable = true) // Register this DataFrame as a table. @@ -1571,7 +1540,7 @@ DataFrame people = sqlContext.read().json("examples/src/main/resources/people.js // The inferred schema can be visualized using the printSchema() method. people.printSchema(); // root -// |-- age: integer (nullable = true) +// |-- age: long (nullable = true) // |-- name: string (nullable = true) // Register this DataFrame as a table. @@ -1609,7 +1578,7 @@ people = sqlContext.read.json("examples/src/main/resources/people.json") # The inferred schema can be visualized using the printSchema() method. people.printSchema() # root -# |-- age: integer (nullable = true) +# |-- age: long (nullable = true) # |-- name: string (nullable = true) # Register this DataFrame as a table. @@ -1648,7 +1617,7 @@ people <- jsonFile(sqlContext, path) # The inferred schema can be visualized using the printSchema() method. printSchema(people) # root -# |-- age: integer (nullable = true) +# |-- age: long (nullable = true) # |-- name: string (nullable = true) # Register this DataFrame as a table. @@ -1687,12 +1656,7 @@ on all of the worker nodes, as they will need access to the Hive serialization a (SerDes) in order to access data stored in Hive. Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` (for security configuration), - `hdfs-site.xml` (for HDFS configuration) file in `conf/`. Please note when running -the query on a YARN cluster (`cluster` mode), the `datanucleus` jars under the `lib` directory -and `hive-site.xml` under `conf/` directory need to be available on the driver and all executors launched by the -YARN cluster. The convenient way to do this is adding them through the `--jars` option and `--file` option of the -`spark-submit` command. - +`hdfs-site.xml` (for HDFS configuration) file in `conf/`. <div class="codetabs"> @@ -2170,8 +2134,6 @@ options. - In the `sql` dialect, floating point numbers are now parsed as decimal. HiveQL parsing remains unchanged. - The canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM). - - It has been determined that using the DirectOutputCommitter when speculation is enabled is unsafe - and thus this output committer will not be used when speculation is on, independent of configuration. - JSON data source will not automatically load new files that are created by other applications (i.e. files that are not inserted to the dataset through Spark SQL). For a JSON persistent table (i.e. the metadata of the table is stored in Hive Metastore), |