[SPARK-13509][SPARK-13507][SQL] Support for writing CSV with a single function call

https://issues.apache.org/jira/browse/SPARK-13507 https://issues.apache.org/jira/browse/SPARK-13509 ## What changes were proposed in this pull request? This PR adds the support to write CSV data directly by a single call to the given path. Several unitests were added for each functionality. ## How was this patch tested? This was tested with unittests and with `dev/run_tests` for coding style Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #11389 from HyukjinKwon/SPARK-13507-13509.
author: hyukjinkwon <gurwls223@gmail.com> 2016-02-29 09:44:29 -0800
committer: Reynold Xin <rxin@databricks.com> 2016-02-29 09:44:29 -0800
commit: 02aa499dfb71bc9571bebb79e6383842e4f48143 (patch)
tree: ae7b44767db0fb5b7c19a03c0de05aa4909c4b8e /python/pyspark
parent: 916fc34f98dd731f607d9b3ed657bad6cc30df2c (diff)
download: spark-02aa499dfb71bc9571bebb79e6383842e4f48143.tar.gz
spark-02aa499dfb71bc9571bebb79e6383842e4f48143.tar.bz2
spark-02aa499dfb71bc9571bebb79e6383842e4f48143.zip
1 files changed, 50 insertions, 0 deletions
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index b1453c637f..7f5368d8bd 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -233,6 +233,23 @@ class DataFrameReader(object):
             paths = [paths]
         return self._df(self._jreader.text(self._sqlContext._sc._jvm.PythonUtils.toSeq(paths)))
 
+    @since(2.0)
+    def csv(self, paths):
+        """Loads a CSV file and returns the result as a [[DataFrame]].
+
+        This function goes through the input once to determine the input schema. To avoid going
+        through the entire data once, specify the schema explicitly using [[schema]].
+
+        :param paths: string, or list of strings, for input path(s).
+
+        >>> df = sqlContext.read.csv('python/test_support/sql/ages.csv')
+        >>> df.dtypes
+        [('C0', 'string'), ('C1', 'string')]
+        """
+        if isinstance(paths, basestring):
+            paths = [paths]
+        return self._df(self._jreader.csv(self._sqlContext._sc._jvm.PythonUtils.toSeq(paths)))
+
     @since(1.5)
     def orc(self, path):
         """Loads an ORC file, returning the result as a :class:`DataFrame`.
@@ -448,6 +465,11 @@ class DataFrameWriter(object):
             * ``ignore``: Silently ignore this operation if data already exists.
             * ``error`` (default case): Throw an exception if data already exists.
 
+        You can set the following JSON-specific option(s) for writing JSON files:
+            * ``compression`` (default ``None``): compression codec to use when saving to file.
+            This can be one of the known case-insensitive shorten names
+            (``bzip2``, ``gzip``, ``lz4``, and ``snappy``).
+
         >>> df.write.json(os.path.join(tempfile.mkdtemp(), 'data'))
         """
         self.mode(mode)._jwrite.json(path)
@@ -476,11 +498,39 @@ class DataFrameWriter(object):
     def text(self, path):
         """Saves the content of the DataFrame in a text file at the specified path.
 
+        :param path: the path in any Hadoop supported file system
+
         The DataFrame must have only one column that is of string type.
         Each row becomes a new line in the output file.
+
+        You can set the following option(s) for writing text files:
+            * ``compression`` (default ``None``): compression codec to use when saving to file.
+            This can be one of the known case-insensitive shorten names
+            (``bzip2``, ``gzip``, ``lz4``, and ``snappy``).
         """
         self._jwrite.text(path)
 
+    @since(2.0)
+    def csv(self, path, mode=None):
+        """Saves the content of the [[DataFrame]] in CSV format at the specified path.
+
+        :param path: the path in any Hadoop supported file system
+        :param mode: specifies the behavior of the save operation when data already exists.
+
+            * ``append``: Append contents of this :class:`DataFrame` to existing data.
+            * ``overwrite``: Overwrite existing data.
+            * ``ignore``: Silently ignore this operation if data already exists.
+            * ``error`` (default case): Throw an exception if data already exists.
+
+        You can set the following CSV-specific option(s) for writing CSV files:
+            * ``compression`` (default ``None``): compression codec to use when saving to file.
+            This can be one of the known case-insensitive shorten names
+            (``bzip2``, ``gzip``, ``lz4``, and ``snappy``).
+
+        >>> df.write.csv(os.path.join(tempfile.mkdtemp(), 'data'))
+        """
+        self.mode(mode)._jwrite.csv(path)
+
     @since(1.5)
     def orc(self, path, mode=None, partitionBy=None):
         """Saves the content of the :class:`DataFrame` in ORC format at the specified path.
author	hyukjinkwon <gurwls223@gmail.com>	2016-02-29 09:44:29 -0800
committer	Reynold Xin <rxin@databricks.com>	2016-02-29 09:44:29 -0800
commit	02aa499dfb71bc9571bebb79e6383842e4f48143 (patch)
tree	ae7b44767db0fb5b7c19a03c0de05aa4909c4b8e /python/pyspark
parent	916fc34f98dd731f607d9b3ed657bad6cc30df2c (diff)
download	spark-02aa499dfb71bc9571bebb79e6383842e4f48143.tar.gz spark-02aa499dfb71bc9571bebb79e6383842e4f48143.tar.bz2 spark-02aa499dfb71bc9571bebb79e6383842e4f48143.zip