<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>pyspark.sql module — PySpark master documentation</title>
<link rel="stylesheet" href="_static/nature.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: './',
VERSION: 'master',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true
};
</script>
<script type="text/javascript" src="_static/jquery.js"></script>
<script type="text/javascript" src="_static/underscore.js"></script>
<script type="text/javascript" src="_static/doctools.js"></script>
<link rel="top" title="PySpark master documentation" href="index.html" />
<link rel="next" title="pyspark.streaming module" href="pyspark.streaming.html" />
<link rel="prev" title="pyspark.mllib package" href="pyspark.mllib.html" />
</head>
<body role="document">
<div class="related" role="navigation" aria-label="related navigation">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="pyspark.streaming.html" title="pyspark.streaming module"
accesskey="N">next</a></li>
<li class="right" >
<a href="pyspark.mllib.html" title="pyspark.mllib package"
accesskey="P">previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">PySpark master documentation</a> »</li>
</ul>
</div>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<div class="section" id="pyspark-sql-module">
<h1>pyspark.sql module<a class="headerlink" href="#pyspark-sql-module" title="Permalink to this headline">¶</a></h1>
<div class="section" id="module-pyspark.sql">
<span id="module-context"></span><h2>Module Context<a class="headerlink" href="#module-pyspark.sql" title="Permalink to this headline">¶</a></h2>
<p>Important classes of Spark SQL and DataFrames:</p>
<blockquote>
<div><ul class="simple">
<li><a class="reference internal" href="#pyspark.sql.SQLContext" title="pyspark.sql.SQLContext"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.SQLContext</span></code></a>
Main entry point for <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> and SQL functionality.</li>
<li><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.DataFrame</span></code></a>
A distributed collection of data grouped into named columns.</li>
<li><a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.Column</span></code></a>
A column expression in a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</li>
<li><a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.Row</span></code></a>
A row of data in a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</li>
<li><a class="reference internal" href="#pyspark.sql.HiveContext" title="pyspark.sql.HiveContext"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.HiveContext</span></code></a>
Main entry point for accessing data stored in Apache Hive.</li>
<li><a class="reference internal" href="#pyspark.sql.GroupedData" title="pyspark.sql.GroupedData"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.GroupedData</span></code></a>
Aggregation methods, returned by <a class="reference internal" href="#pyspark.sql.DataFrame.groupBy" title="pyspark.sql.DataFrame.groupBy"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.groupBy()</span></code></a>.</li>
<li><a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions" title="pyspark.sql.DataFrameNaFunctions"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.DataFrameNaFunctions</span></code></a>
Methods for handling missing data (null values).</li>
<li><a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions" title="pyspark.sql.DataFrameStatFunctions"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.DataFrameStatFunctions</span></code></a>
Methods for statistics functionality.</li>
<li><a class="reference internal" href="#module-pyspark.sql.functions" title="pyspark.sql.functions"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.functions</span></code></a>
List of built-in functions available for <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</li>
<li><a class="reference internal" href="#module-pyspark.sql.types" title="pyspark.sql.types"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.types</span></code></a>
List of data types available.</li>
<li><a class="reference internal" href="#pyspark.sql.Window" title="pyspark.sql.Window"><code class="xref py py-class docutils literal"><span class="pre">pyspark.sql.Window</span></code></a>
For working with window functions.</li>
</ul>
</div></blockquote>
<dl class="class">
<dt id="pyspark.sql.SQLContext">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">SQLContext</code><span class="sig-paren">(</span><em>sparkContext</em>, <em>sqlContext=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext" title="Permalink to this definition">¶</a></dt>
<dd><p>Main entry point for Spark SQL functionality.</p>
<p>A SQLContext can be used create <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>, register <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> as
tables, execute SQL over tables, cache tables, and read parquet files.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>sparkContext</strong> – The <code class="xref py py-class docutils literal"><span class="pre">SparkContext</span></code> backing this SQLContext.</li>
<li><strong>sqlContext</strong> – An optional JVM Scala SQLContext. If set, we do not instantiate a new
SQLContext in the JVM, instead we make all calls to this object.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<dl class="method">
<dt id="pyspark.sql.SQLContext.applySchema">
<code class="descname">applySchema</code><span class="sig-paren">(</span><em>rdd</em>, <em>schema</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.applySchema"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.applySchema" title="Permalink to this definition">¶</a></dt>
<dd><div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 1.3, use <a class="reference internal" href="#pyspark.sql.SQLContext.createDataFrame" title="pyspark.sql.SQLContext.createDataFrame"><code class="xref py py-func docutils literal"><span class="pre">createDataFrame()</span></code></a> instead.</p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.cacheTable">
<code class="descname">cacheTable</code><span class="sig-paren">(</span><em>tableName</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.cacheTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.cacheTable" title="Permalink to this definition">¶</a></dt>
<dd><p>Caches the specified table in-memory.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.clearCache">
<code class="descname">clearCache</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.clearCache"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.clearCache" title="Permalink to this definition">¶</a></dt>
<dd><p>Removes all cached tables from the in-memory cache.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.createDataFrame">
<code class="descname">createDataFrame</code><span class="sig-paren">(</span><em>data</em>, <em>schema=None</em>, <em>samplingRatio=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.createDataFrame"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.createDataFrame" title="Permalink to this definition">¶</a></dt>
<dd><p>Creates a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> from an <code class="xref py py-class docutils literal"><span class="pre">RDD</span></code> of <code class="xref py py-class docutils literal"><span class="pre">tuple</span></code>/<code class="xref py py-class docutils literal"><span class="pre">list</span></code>,
list or <code class="xref py py-class docutils literal"><span class="pre">pandas.DataFrame</span></code>.</p>
<p>When <code class="docutils literal"><span class="pre">schema</span></code> is a list of column names, the type of each column
will be inferred from <code class="docutils literal"><span class="pre">data</span></code>.</p>
<p>When <code class="docutils literal"><span class="pre">schema</span></code> is <code class="docutils literal"><span class="pre">None</span></code>, it will try to infer the schema (column names and types)
from <code class="docutils literal"><span class="pre">data</span></code>, which should be an RDD of <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>,
or <code class="xref py py-class docutils literal"><span class="pre">namedtuple</span></code>, or <code class="xref py py-class docutils literal"><span class="pre">dict</span></code>.</p>
<p>If schema inference is needed, <code class="docutils literal"><span class="pre">samplingRatio</span></code> is used to determined the ratio of
rows used for schema inference. The first row will be used if <code class="docutils literal"><span class="pre">samplingRatio</span></code> is <code class="docutils literal"><span class="pre">None</span></code>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>data</strong> – an RDD of <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>/<code class="xref py py-class docutils literal"><span class="pre">tuple</span></code>/<code class="xref py py-class docutils literal"><span class="pre">list</span></code>/<code class="xref py py-class docutils literal"><span class="pre">dict</span></code>,
<code class="xref py py-class docutils literal"><span class="pre">list</span></code>, or <code class="xref py py-class docutils literal"><span class="pre">pandas.DataFrame</span></code>.</li>
<li><strong>schema</strong> – a <code class="xref py py-class docutils literal"><span class="pre">StructType</span></code> or list of column names. default None.</li>
<li><strong>samplingRatio</strong> – the sample ratio of rows used for inferring</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></p>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">l</span> <span class="o">=</span> <span class="p">[(</span><span class="s">'Alice'</span><span class="p">,</span> <span class="mi">1</span><span class="p">)]</span>
<span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">l</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(_1=u'Alice', _2=1)]</span>
<span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="p">[</span><span class="s">'name'</span><span class="p">,</span> <span class="s">'age'</span><span class="p">])</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Alice', age=1)]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">d</span> <span class="o">=</span> <span class="p">[{</span><span class="s">'name'</span><span class="p">:</span> <span class="s">'Alice'</span><span class="p">,</span> <span class="s">'age'</span><span class="p">:</span> <span class="mi">1</span><span class="p">}]</span>
<span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">d</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=1, name=u'Alice')]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">rdd</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">rdd</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(_1=u'Alice', _2=1)]</span>
<span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">rdd</span><span class="p">,</span> <span class="p">[</span><span class="s">'name'</span><span class="p">,</span> <span class="s">'age'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Alice', age=1)]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">Row</span>
<span class="gp">>>> </span><span class="n">Person</span> <span class="o">=</span> <span class="n">Row</span><span class="p">(</span><span class="s">'name'</span><span class="p">,</span> <span class="s">'age'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">person</span> <span class="o">=</span> <span class="n">rdd</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">r</span><span class="p">:</span> <span class="n">Person</span><span class="p">(</span><span class="o">*</span><span class="n">r</span><span class="p">))</span>
<span class="gp">>>> </span><span class="n">df2</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">person</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Alice', age=1)]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="kn">import</span> <span class="o">*</span>
<span class="gp">>>> </span><span class="n">schema</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">([</span>
<span class="gp">... </span> <span class="n">StructField</span><span class="p">(</span><span class="s">"name"</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">),</span>
<span class="gp">... </span> <span class="n">StructField</span><span class="p">(</span><span class="s">"age"</span><span class="p">,</span> <span class="n">IntegerType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">)])</span>
<span class="gp">>>> </span><span class="n">df3</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">rdd</span><span class="p">,</span> <span class="n">schema</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df3</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Alice', age=1)]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">toPandas</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Alice', age=1)]</span>
<span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">pandas</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">]])</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">[Row(0=1, 1=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.createExternalTable">
<code class="descname">createExternalTable</code><span class="sig-paren">(</span><em>tableName</em>, <em>path=None</em>, <em>source=None</em>, <em>schema=None</em>, <em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.createExternalTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.createExternalTable" title="Permalink to this definition">¶</a></dt>
<dd><p>Creates an external table based on the dataset in a data source.</p>
<p>It returns the DataFrame associated with the external table.</p>
<p>The data source is specified by the <code class="docutils literal"><span class="pre">source</span></code> and a set of <code class="docutils literal"><span class="pre">options</span></code>.
If <code class="docutils literal"><span class="pre">source</span></code> is not specified, the default data source configured by
<code class="docutils literal"><span class="pre">spark.sql.sources.default</span></code> will be used.</p>
<p>Optionally, a schema can be provided as the schema of the returned <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> and
created external table.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.getConf">
<code class="descname">getConf</code><span class="sig-paren">(</span><em>key</em>, <em>defaultValue</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.getConf"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.getConf" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the value of Spark SQL configuration property for the given key.</p>
<p>If the key is not set, returns defaultValue.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.inferSchema">
<code class="descname">inferSchema</code><span class="sig-paren">(</span><em>rdd</em>, <em>samplingRatio=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.inferSchema"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.inferSchema" title="Permalink to this definition">¶</a></dt>
<dd><div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 1.3, use <a class="reference internal" href="#pyspark.sql.SQLContext.createDataFrame" title="pyspark.sql.SQLContext.createDataFrame"><code class="xref py py-func docutils literal"><span class="pre">createDataFrame()</span></code></a> instead.</p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.jsonFile">
<code class="descname">jsonFile</code><span class="sig-paren">(</span><em>path</em>, <em>schema=None</em>, <em>samplingRatio=1.0</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.jsonFile"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.jsonFile" title="Permalink to this definition">¶</a></dt>
<dd><p>Loads a text file storing one JSON object per line as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 1.4, use <a class="reference internal" href="#pyspark.sql.DataFrameReader.json" title="pyspark.sql.DataFrameReader.json"><code class="xref py py-func docutils literal"><span class="pre">DataFrameReader.json()</span></code></a> instead.</p>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">jsonFile</span><span class="p">(</span><span class="s">'python/test_support/sql/people.json'</span><span class="p">)</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[('age', 'bigint'), ('name', 'string')]</span>
</pre></div>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.jsonRDD">
<code class="descname">jsonRDD</code><span class="sig-paren">(</span><em>rdd</em>, <em>schema=None</em>, <em>samplingRatio=1.0</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.jsonRDD"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.jsonRDD" title="Permalink to this definition">¶</a></dt>
<dd><p>Loads an RDD storing one JSON object per string as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<p>If the schema is provided, applies the given schema to this JSON dataset.
Otherwise, it samples the dataset with ratio <code class="docutils literal"><span class="pre">samplingRatio</span></code> to determine the schema.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df1</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">jsonRDD</span><span class="p">(</span><span class="n">json</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df1</span><span class="o">.</span><span class="n">first</span><span class="p">()</span>
<span class="go">Row(field1=1, field2=u'row1', field3=Row(field4=11, field5=None), field6=None)</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df2</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">jsonRDD</span><span class="p">(</span><span class="n">json</span><span class="p">,</span> <span class="n">df1</span><span class="o">.</span><span class="n">schema</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df2</span><span class="o">.</span><span class="n">first</span><span class="p">()</span>
<span class="go">Row(field1=1, field2=u'row1', field3=Row(field4=11, field5=None), field6=None)</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="kn">import</span> <span class="o">*</span>
<span class="gp">>>> </span><span class="n">schema</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">([</span>
<span class="gp">... </span> <span class="n">StructField</span><span class="p">(</span><span class="s">"field2"</span><span class="p">,</span> <span class="n">StringType</span><span class="p">()),</span>
<span class="gp">... </span> <span class="n">StructField</span><span class="p">(</span><span class="s">"field3"</span><span class="p">,</span>
<span class="gp">... </span> <span class="n">StructType</span><span class="p">([</span><span class="n">StructField</span><span class="p">(</span><span class="s">"field5"</span><span class="p">,</span> <span class="n">ArrayType</span><span class="p">(</span><span class="n">IntegerType</span><span class="p">()))]))</span>
<span class="gp">... </span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df3</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">jsonRDD</span><span class="p">(</span><span class="n">json</span><span class="p">,</span> <span class="n">schema</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df3</span><span class="o">.</span><span class="n">first</span><span class="p">()</span>
<span class="go">Row(field2=u'row1', field3=Row(field5=None))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.load">
<code class="descname">load</code><span class="sig-paren">(</span><em>path=None</em>, <em>source=None</em>, <em>schema=None</em>, <em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.load"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.load" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the dataset in a data source as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 1.4, use <a class="reference internal" href="#pyspark.sql.DataFrameReader.load" title="pyspark.sql.DataFrameReader.load"><code class="xref py py-func docutils literal"><span class="pre">DataFrameReader.load()</span></code></a> instead.</p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.parquetFile">
<code class="descname">parquetFile</code><span class="sig-paren">(</span><em>*paths</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.parquetFile"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.parquetFile" title="Permalink to this definition">¶</a></dt>
<dd><p>Loads a Parquet file, returning the result as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 1.4, use <a class="reference internal" href="#pyspark.sql.DataFrameReader.parquet" title="pyspark.sql.DataFrameReader.parquet"><code class="xref py py-func docutils literal"><span class="pre">DataFrameReader.parquet()</span></code></a> instead.</p>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">parquetFile</span><span class="p">(</span><span class="s">'python/test_support/sql/parquet_partitioned'</span><span class="p">)</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]</span>
</pre></div>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.range">
<code class="descname">range</code><span class="sig-paren">(</span><em>start</em>, <em>end=None</em>, <em>step=1</em>, <em>numPartitions=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.range"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.range" title="Permalink to this definition">¶</a></dt>
<dd><p>Create a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> with single LongType column named <cite>id</cite>,
containing elements in a range from <cite>start</cite> to <cite>end</cite> (exclusive) with
step value <cite>step</cite>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>start</strong> – the start value</li>
<li><strong>end</strong> – the end value (exclusive)</li>
<li><strong>step</strong> – the incremental step (default: 1)</li>
<li><strong>numPartitions</strong> – the number of partitions of the DataFrame</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></p>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(id=1), Row(id=3), Row(id=5)]</span>
</pre></div>
</div>
<p>If only one argument is specified, it will be used as the end value.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">range</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(id=0), Row(id=1), Row(id=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.SQLContext.read">
<code class="descname">read</code><a class="headerlink" href="#pyspark.sql.SQLContext.read" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a <a class="reference internal" href="#pyspark.sql.DataFrameReader" title="pyspark.sql.DataFrameReader"><code class="xref py py-class docutils literal"><span class="pre">DataFrameReader</span></code></a> that can be used to read data
in as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.DataFrameReader" title="pyspark.sql.DataFrameReader"><code class="xref py py-class docutils literal"><span class="pre">DataFrameReader</span></code></a></td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.registerDataFrameAsTable">
<code class="descname">registerDataFrameAsTable</code><span class="sig-paren">(</span><em>df</em>, <em>tableName</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.registerDataFrameAsTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.registerDataFrameAsTable" title="Permalink to this definition">¶</a></dt>
<dd><p>Registers the given <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> as a temporary table in the catalog.</p>
<p>Temporary tables exist only during the lifetime of this instance of <a class="reference internal" href="#pyspark.sql.SQLContext" title="pyspark.sql.SQLContext"><code class="xref py py-class docutils literal"><span class="pre">SQLContext</span></code></a>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerDataFrameAsTable</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="s">"table1"</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.registerFunction">
<code class="descname">registerFunction</code><span class="sig-paren">(</span><em>name</em>, <em>f</em>, <em>returnType=StringType</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.registerFunction"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.registerFunction" title="Permalink to this definition">¶</a></dt>
<dd><p>Registers a lambda function as a UDF so it can be used in SQL statements.</p>
<p>In addition to a name and the function itself, the return type can be optionally specified.
When the return type is not given it default to a string and conversion will automatically
be done. For any other return type, the produced object must match the specified type.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>name</strong> – name of the UDF</li>
<li><strong>samplingRatio</strong> – lambda function</li>
<li><strong>returnType</strong> – a <code class="xref py py-class docutils literal"><span class="pre">DataType</span></code> object</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerFunction</span><span class="p">(</span><span class="s">"stringLengthString"</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"SELECT stringLengthString('test')"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(_c0=u'4')]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="kn">import</span> <span class="n">IntegerType</span>
<span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerFunction</span><span class="p">(</span><span class="s">"stringLengthInt"</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="n">IntegerType</span><span class="p">())</span>
<span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"SELECT stringLengthInt('test')"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(_c0=4)]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="kn">import</span> <span class="n">IntegerType</span>
<span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">udf</span><span class="o">.</span><span class="n">register</span><span class="p">(</span><span class="s">"stringLengthInt"</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="n">IntegerType</span><span class="p">())</span>
<span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"SELECT stringLengthInt('test')"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(_c0=4)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.2.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.setConf">
<code class="descname">setConf</code><span class="sig-paren">(</span><em>key</em>, <em>value</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.setConf"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.setConf" title="Permalink to this definition">¶</a></dt>
<dd><p>Sets the given Spark SQL configuration property.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.sql">
<code class="descname">sql</code><span class="sig-paren">(</span><em>sqlQuery</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.sql"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.sql" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> representing the result of the given query.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerDataFrameAsTable</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="s">"table1"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df2</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"SELECT field1 AS f1, field2 as f2 from table1"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.table">
<code class="descname">table</code><span class="sig-paren">(</span><em>tableName</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.table"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.table" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the specified table as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerDataFrameAsTable</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="s">"table1"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df2</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">table</span><span class="p">(</span><span class="s">"table1"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span> <span class="o">==</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">True</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.0.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.tableNames">
<code class="descname">tableNames</code><span class="sig-paren">(</span><em>dbName=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.tableNames"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.tableNames" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a list of names of tables in the database <code class="docutils literal"><span class="pre">dbName</span></code>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>dbName</strong> – string, name of the database to use. Default to the current database.</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body">list of table names, in string</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerDataFrameAsTable</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="s">"table1"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="s">"table1"</span> <span class="ow">in</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">tableNames</span><span class="p">()</span>
<span class="go">True</span>
<span class="gp">>>> </span><span class="s">"table1"</span> <span class="ow">in</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">tableNames</span><span class="p">(</span><span class="s">"db"</span><span class="p">)</span>
<span class="go">True</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.tables">
<code class="descname">tables</code><span class="sig-paren">(</span><em>dbName=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.tables"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.tables" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> containing names of tables in the given database.</p>
<p>If <code class="docutils literal"><span class="pre">dbName</span></code> is not specified, the current database will be used.</p>
<p>The returned DataFrame has two columns: <code class="docutils literal"><span class="pre">tableName</span></code> and <code class="docutils literal"><span class="pre">isTemporary</span></code>
(a column with <code class="xref py py-class docutils literal"><span class="pre">BooleanType</span></code> indicating if a table is a temporary one or not).</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>dbName</strong> – string, name of the database to use.</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a></td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">registerDataFrameAsTable</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="s">"table1"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df2</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">tables</span><span class="p">()</span>
<span class="gp">>>> </span><span class="n">df2</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="s">"tableName = 'table1'"</span><span class="p">)</span><span class="o">.</span><span class="n">first</span><span class="p">()</span>
<span class="go">Row(tableName=u'table1', isTemporary=True)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.SQLContext.udf">
<code class="descname">udf</code><a class="headerlink" href="#pyspark.sql.SQLContext.udf" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a <code class="xref py py-class docutils literal"><span class="pre">UDFRegistration</span></code> for UDF registration.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><code class="xref py py-class docutils literal"><span class="pre">UDFRegistration</span></code></td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.SQLContext.uncacheTable">
<code class="descname">uncacheTable</code><span class="sig-paren">(</span><em>tableName</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#SQLContext.uncacheTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.SQLContext.uncacheTable" title="Permalink to this definition">¶</a></dt>
<dd><p>Removes the specified table from the in-memory cache.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.0.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.HiveContext">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">HiveContext</code><span class="sig-paren">(</span><em>sparkContext</em>, <em>hiveContext=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#HiveContext"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.HiveContext" title="Permalink to this definition">¶</a></dt>
<dd><p>A variant of Spark SQL that integrates with data stored in Hive.</p>
<p>Configuration for Hive is read from <code class="docutils literal"><span class="pre">hive-site.xml</span></code> on the classpath.
It supports running both SQL and HiveQL commands.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>sparkContext</strong> – The SparkContext to wrap.</li>
<li><strong>hiveContext</strong> – An optional JVM Scala HiveContext. If set, we do not instantiate a new
<a class="reference internal" href="#pyspark.sql.HiveContext" title="pyspark.sql.HiveContext"><code class="xref py py-class docutils literal"><span class="pre">HiveContext</span></code></a> in the JVM, instead we make all calls to this object.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<dl class="method">
<dt id="pyspark.sql.HiveContext.refreshTable">
<code class="descname">refreshTable</code><span class="sig-paren">(</span><em>tableName</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/context.html#HiveContext.refreshTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.HiveContext.refreshTable" title="Permalink to this definition">¶</a></dt>
<dd><p>Invalidate and refresh all the cached the metadata of the given
table. For performance reasons, Spark SQL or the external data source
library it uses might cache certain metadata about a table, such as the
location of blocks. When those change outside of Spark SQL, users should
call this function to invalidate the cache.</p>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.DataFrame">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">DataFrame</code><span class="sig-paren">(</span><em>jdf</em>, <em>sql_ctx</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame" title="Permalink to this definition">¶</a></dt>
<dd><p>A distributed collection of data grouped into named columns.</p>
<p>A <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> is equivalent to a relational table in Spark SQL,
and can be created using various functions in <a class="reference internal" href="#pyspark.sql.SQLContext" title="pyspark.sql.SQLContext"><code class="xref py py-class docutils literal"><span class="pre">SQLContext</span></code></a>:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">people</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="s">"..."</span><span class="p">)</span>
</pre></div>
</div>
<p>Once created, it can be manipulated using the various domain-specific-language
(DSL) functions defined in: <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>, <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a>.</p>
<p>To select a column from the data frame, use the apply method:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">ageCol</span> <span class="o">=</span> <span class="n">people</span><span class="o">.</span><span class="n">age</span>
</pre></div>
</div>
<p>A more concrete example:</p>
<div class="highlight-python"><div class="highlight"><pre># To create DataFrame using SQLContext
people = sqlContext.read.parquet("...")
department = sqlContext.read.parquet("...")
people.filter(people.age > 30).join(department, people.deptId == department.id)) .groupBy(department.name, "gender").agg({"salary": "avg", "age": "max"})
</pre></div>
</div>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Experimental</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.DataFrame.agg">
<code class="descname">agg</code><span class="sig-paren">(</span><em>*exprs</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.agg"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.agg" title="Permalink to this definition">¶</a></dt>
<dd><p>Aggregate on the entire <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> without groups
(shorthand for <code class="docutils literal"><span class="pre">df.groupBy.agg()</span></code>).</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s">"age"</span><span class="p">:</span> <span class="s">"max"</span><span class="p">})</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(max(age)=5)]</span>
<span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">functions</span> <span class="k">as</span> <span class="n">F</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(min(age)=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.alias">
<code class="descname">alias</code><span class="sig-paren">(</span><em>alias</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.alias"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.alias" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> with an alias set.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="kn">import</span> <span class="o">*</span>
<span class="gp">>>> </span><span class="n">df_as1</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"df_as1"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df_as2</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"df_as2"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">joined_df</span> <span class="o">=</span> <span class="n">df_as1</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df_as2</span><span class="p">,</span> <span class="n">col</span><span class="p">(</span><span class="s">"df_as1.name"</span><span class="p">)</span> <span class="o">==</span> <span class="n">col</span><span class="p">(</span><span class="s">"df_as2.name"</span><span class="p">),</span> <span class="s">'inner'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">joined_df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">col</span><span class="p">(</span><span class="s">"df_as1.name"</span><span class="p">),</span> <span class="n">col</span><span class="p">(</span><span class="s">"df_as2.name"</span><span class="p">),</span> <span class="n">col</span><span class="p">(</span><span class="s">"df_as2.age"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Alice', name=u'Alice', age=2), Row(name=u'Bob', name=u'Bob', age=5)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.cache">
<code class="descname">cache</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.cache"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.cache" title="Permalink to this definition">¶</a></dt>
<dd><p>Persists with the default storage level (<code class="xref py py-class docutils literal"><span class="pre">MEMORY_ONLY_SER</span></code>).</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.coalesce">
<code class="descname">coalesce</code><span class="sig-paren">(</span><em>numPartitions</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.coalesce"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.coalesce" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> that has exactly <cite>numPartitions</cite> partitions.</p>
<p>Similar to coalesce defined on an <code class="xref py py-class docutils literal"><span class="pre">RDD</span></code>, this operation results in a
narrow dependency, e.g. if you go from 1000 partitions to 100 partitions,
there will not be a shuffle, instead each of the 100 new partitions will
claim 10 of the current partitions.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">coalesce</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">rdd</span><span class="o">.</span><span class="n">getNumPartitions</span><span class="p">()</span>
<span class="go">1</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.collect">
<code class="descname">collect</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.collect"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.collect" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns all the records as a list of <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.columns">
<code class="descname">columns</code><a class="headerlink" href="#pyspark.sql.DataFrame.columns" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns all column names as a list.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">columns</span>
<span class="go">['age', 'name']</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.corr">
<code class="descname">corr</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em>, <em>method=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.corr"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.corr" title="Permalink to this definition">¶</a></dt>
<dd><p>Calculates the correlation of two columns of a DataFrame as a double value.
Currently only supports the Pearson Correlation Coefficient.
<a class="reference internal" href="#pyspark.sql.DataFrame.corr" title="pyspark.sql.DataFrame.corr"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.corr()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.corr" title="pyspark.sql.DataFrameStatFunctions.corr"><code class="xref py py-func docutils literal"><span class="pre">DataFrameStatFunctions.corr()</span></code></a> are aliases of each other.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col1</strong> – The name of the first column</li>
<li><strong>col2</strong> – The name of the second column</li>
<li><strong>method</strong> – The correlation method. Currently only supports “pearson”</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.count">
<code class="descname">count</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.count"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.count" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the number of rows in this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">2</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.cov">
<code class="descname">cov</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.cov"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.cov" title="Permalink to this definition">¶</a></dt>
<dd><p>Calculate the sample covariance for the given columns, specified by their names, as a
double value. <a class="reference internal" href="#pyspark.sql.DataFrame.cov" title="pyspark.sql.DataFrame.cov"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.cov()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.cov" title="pyspark.sql.DataFrameStatFunctions.cov"><code class="xref py py-func docutils literal"><span class="pre">DataFrameStatFunctions.cov()</span></code></a> are aliases.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col1</strong> – The name of the first column</li>
<li><strong>col2</strong> – The name of the second column</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.crosstab">
<code class="descname">crosstab</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.crosstab"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.crosstab" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes a pair-wise frequency table of the given columns. Also known as a contingency
table. The number of distinct values for each column should be less than 1e4. At most 1e6
non-zero pair frequencies will be returned.
The first column of each row will be the distinct values of <cite>col1</cite> and the column names
will be the distinct values of <cite>col2</cite>. The name of the first column will be <cite>$col1_$col2</cite>.
Pairs that have no occurrences will have zero as their counts.
<a class="reference internal" href="#pyspark.sql.DataFrame.crosstab" title="pyspark.sql.DataFrame.crosstab"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.crosstab()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.crosstab" title="pyspark.sql.DataFrameStatFunctions.crosstab"><code class="xref py py-func docutils literal"><span class="pre">DataFrameStatFunctions.crosstab()</span></code></a> are aliases.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col1</strong> – The name of the first column. Distinct items will make the first item of
each row.</li>
<li><strong>col2</strong> – The name of the second column. Distinct items will make the column names
of the DataFrame.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.cube">
<code class="descname">cube</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.cube"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.cube" title="Permalink to this definition">¶</a></dt>
<dd><p>Create a multi-dimensional cube for the current <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> using
the specified columns, so we can run aggregation on them.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">cube</span><span class="p">(</span><span class="s">'name'</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-----+----+-----+</span>
<span class="go">| name| age|count|</span>
<span class="go">+-----+----+-----+</span>
<span class="go">| null| 2| 1|</span>
<span class="go">|Alice|null| 1|</span>
<span class="go">| Bob| 5| 1|</span>
<span class="go">| Bob|null| 1|</span>
<span class="go">| null| 5| 1|</span>
<span class="go">| null|null| 2|</span>
<span class="go">|Alice| 2| 1|</span>
<span class="go">+-----+----+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.describe">
<code class="descname">describe</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.describe"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.describe" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes statistics for numeric columns.</p>
<p>This include count, mean, stddev, min, and max. If no columns are
given, this function computes statistics for all numerical columns.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.</p>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-------+---+</span>
<span class="go">|summary|age|</span>
<span class="go">+-------+---+</span>
<span class="go">| count| 2|</span>
<span class="go">| mean|3.5|</span>
<span class="go">| stddev|1.5|</span>
<span class="go">| min| 2|</span>
<span class="go">| max| 5|</span>
<span class="go">+-------+---+</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">describe</span><span class="p">([</span><span class="s">'age'</span><span class="p">,</span> <span class="s">'name'</span><span class="p">])</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-------+---+-----+</span>
<span class="go">|summary|age| name|</span>
<span class="go">+-------+---+-----+</span>
<span class="go">| count| 2| 2|</span>
<span class="go">| mean|3.5| null|</span>
<span class="go">| stddev|1.5| null|</span>
<span class="go">| min| 2|Alice|</span>
<span class="go">| max| 5| Bob|</span>
<span class="go">+-------+---+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.distinct">
<code class="descname">distinct</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.distinct"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.distinct" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> containing the distinct rows in this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">distinct</span><span class="p">()</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">2</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.drop">
<code class="descname">drop</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.drop"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.drop" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> that drops the specified column.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>col</strong> – a string name of the column to drop, or a
<a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> to drop.</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'age'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Alice'), Row(name=u'Bob')]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Alice'), Row(name=u'Bob')]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="n">df2</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="s">'inner'</span><span class="p">)</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, height=85, name=u'Bob')]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="n">df2</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="s">'inner'</span><span class="p">)</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob', height=85)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.dropDuplicates">
<code class="descname">dropDuplicates</code><span class="sig-paren">(</span><em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.dropDuplicates"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.dropDuplicates" title="Permalink to this definition">¶</a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> with duplicate rows removed,
optionally only considering certain columns.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">Row</span>
<span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'Alice'</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="mi">80</span><span class="p">),</span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'Alice'</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="mi">80</span><span class="p">),</span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'Alice'</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="mi">80</span><span class="p">)])</span><span class="o">.</span><span class="n">toDF</span><span class="p">()</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">dropDuplicates</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">| 5| 80|Alice|</span>
<span class="go">| 10| 80|Alice|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">dropDuplicates</span><span class="p">([</span><span class="s">'name'</span><span class="p">,</span> <span class="s">'height'</span><span class="p">])</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">| 5| 80|Alice|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.drop_duplicates">
<code class="descname">drop_duplicates</code><span class="sig-paren">(</span><em>subset=None</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.DataFrame.drop_duplicates" title="Permalink to this definition">¶</a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> with duplicate rows removed,
optionally only considering certain columns.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">Row</span>
<span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'Alice'</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="mi">80</span><span class="p">),</span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'Alice'</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="mi">80</span><span class="p">),</span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'Alice'</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="mi">80</span><span class="p">)])</span><span class="o">.</span><span class="n">toDF</span><span class="p">()</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">dropDuplicates</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">| 5| 80|Alice|</span>
<span class="go">| 10| 80|Alice|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">dropDuplicates</span><span class="p">([</span><span class="s">'name'</span><span class="p">,</span> <span class="s">'height'</span><span class="p">])</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">| 5| 80|Alice|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.dropna">
<code class="descname">dropna</code><span class="sig-paren">(</span><em>how='any'</em>, <em>thresh=None</em>, <em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.dropna"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.dropna" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> omitting rows with null values.
<a class="reference internal" href="#pyspark.sql.DataFrame.dropna" title="pyspark.sql.DataFrame.dropna"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.dropna()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions.drop" title="pyspark.sql.DataFrameNaFunctions.drop"><code class="xref py py-func docutils literal"><span class="pre">DataFrameNaFunctions.drop()</span></code></a> are aliases of each other.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>how</strong> – ‘any’ or ‘all’.
If ‘any’, drop a row if it contains any nulls.
If ‘all’, drop a row only if all its values are null.</li>
<li><strong>thresh</strong> – int, default None
If specified, drop rows that have less than <cite>thresh</cite> non-null values.
This overwrites the <cite>how</cite> parameter.</li>
<li><strong>subset</strong> – optional list of column names to consider.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">drop</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">| 10| 80|Alice|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.1.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.dtypes">
<code class="descname">dtypes</code><a class="headerlink" href="#pyspark.sql.DataFrame.dtypes" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns all column names and their data types as a list.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[('age', 'int'), ('name', 'string')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.explain">
<code class="descname">explain</code><span class="sig-paren">(</span><em>extended=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.explain"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.explain" title="Permalink to this definition">¶</a></dt>
<dd><p>Prints the (logical and physical) plans to the console for debugging purpose.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>extended</strong> – boolean, default <code class="docutils literal"><span class="pre">False</span></code>. If <code class="docutils literal"><span class="pre">False</span></code>, prints only the physical plan.</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">explain</span><span class="p">()</span>
<span class="go">Scan PhysicalRDD[age#0,name#1]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">explain</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span>
<span class="go">== Parsed Logical Plan ==</span>
<span class="gp">...</span>
<span class="go">== Analyzed Logical Plan ==</span>
<span class="gp">...</span>
<span class="go">== Optimized Logical Plan ==</span>
<span class="gp">...</span>
<span class="go">== Physical Plan ==</span>
<span class="gp">...</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.fillna">
<code class="descname">fillna</code><span class="sig-paren">(</span><em>value</em>, <em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.fillna"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.fillna" title="Permalink to this definition">¶</a></dt>
<dd><p>Replace null values, alias for <code class="docutils literal"><span class="pre">na.fill()</span></code>.
<a class="reference internal" href="#pyspark.sql.DataFrame.fillna" title="pyspark.sql.DataFrame.fillna"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.fillna()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions.fill" title="pyspark.sql.DataFrameNaFunctions.fill"><code class="xref py py-func docutils literal"><span class="pre">DataFrameNaFunctions.fill()</span></code></a> are aliases of each other.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>value</strong> – int, long, float, string, or dict.
Value to replace null values with.
If the value is a dict, then <cite>subset</cite> is ignored and <cite>value</cite> must be a mapping
from column name (string) to replacement value. The replacement value must be
an int, long, float, or string.</li>
<li><strong>subset</strong> – optional list of column names to consider.
Columns specified in subset that do not have matching data type are ignored.
For example, if <cite>value</cite> is a string, and subset contains a non-string column,
then the non-string column is simply ignored.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">fill</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">| 10| 80|Alice|</span>
<span class="go">| 5| 50| Bob|</span>
<span class="go">| 50| 50| Tom|</span>
<span class="go">| 50| 50| null|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">fill</span><span class="p">({</span><span class="s">'age'</span><span class="p">:</span> <span class="mi">50</span><span class="p">,</span> <span class="s">'name'</span><span class="p">:</span> <span class="s">'unknown'</span><span class="p">})</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-------+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-------+</span>
<span class="go">| 10| 80| Alice|</span>
<span class="go">| 5| null| Bob|</span>
<span class="go">| 50| null| Tom|</span>
<span class="go">| 50| null|unknown|</span>
<span class="go">+---+------+-------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.filter">
<code class="descname">filter</code><span class="sig-paren">(</span><em>condition</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.filter"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.filter" title="Permalink to this definition">¶</a></dt>
<dd><p>Filters rows using the given condition.</p>
<p><a class="reference internal" href="#pyspark.sql.DataFrame.where" title="pyspark.sql.DataFrame.where"><code class="xref py py-func docutils literal"><span class="pre">where()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.DataFrame.filter" title="pyspark.sql.DataFrame.filter"><code class="xref py py-func docutils literal"><span class="pre">filter()</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>condition</strong> – a <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> of <a class="reference internal" href="#pyspark.sql.types.BooleanType" title="pyspark.sql.types.BooleanType"><code class="xref py py-class docutils literal"><span class="pre">types.BooleanType</span></code></a>
or a string of SQL expression.</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">></span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">==</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=u'Alice')]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="s">"age > 3"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s">"age = 2"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=u'Alice')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.first">
<code class="descname">first</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.first"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.first" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the first row as a <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">first</span><span class="p">()</span>
<span class="go">Row(age=2, name=u'Alice')</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.flatMap">
<code class="descname">flatMap</code><span class="sig-paren">(</span><em>f</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.flatMap"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.flatMap" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <code class="xref py py-class docutils literal"><span class="pre">RDD</span></code> by first applying the <code class="docutils literal"><span class="pre">f</span></code> function to each <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>,
and then flattening the results.</p>
<p>This is a shorthand for <code class="docutils literal"><span class="pre">df.rdd.flatMap()</span></code>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">flatMap</span><span class="p">(</span><span class="k">lambda</span> <span class="n">p</span><span class="p">:</span> <span class="n">p</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[u'A', u'l', u'i', u'c', u'e', u'B', u'o', u'b']</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.foreach">
<code class="descname">foreach</code><span class="sig-paren">(</span><em>f</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.foreach"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.foreach" title="Permalink to this definition">¶</a></dt>
<dd><p>Applies the <code class="docutils literal"><span class="pre">f</span></code> function to all <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a> of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<p>This is a shorthand for <code class="docutils literal"><span class="pre">df.rdd.foreach()</span></code>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="k">def</span> <span class="nf">f</span><span class="p">(</span><span class="n">person</span><span class="p">):</span>
<span class="gp">... </span> <span class="k">print</span><span class="p">(</span><span class="n">person</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">foreach</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.foreachPartition">
<code class="descname">foreachPartition</code><span class="sig-paren">(</span><em>f</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.foreachPartition"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.foreachPartition" title="Permalink to this definition">¶</a></dt>
<dd><p>Applies the <code class="docutils literal"><span class="pre">f</span></code> function to each partition of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<p>This a shorthand for <code class="docutils literal"><span class="pre">df.rdd.foreachPartition()</span></code>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="k">def</span> <span class="nf">f</span><span class="p">(</span><span class="n">people</span><span class="p">):</span>
<span class="gp">... </span> <span class="k">for</span> <span class="n">person</span> <span class="ow">in</span> <span class="n">people</span><span class="p">:</span>
<span class="gp">... </span> <span class="k">print</span><span class="p">(</span><span class="n">person</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">foreachPartition</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.freqItems">
<code class="descname">freqItems</code><span class="sig-paren">(</span><em>cols</em>, <em>support=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.freqItems"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.freqItems" title="Permalink to this definition">¶</a></dt>
<dd><p>Finding frequent items for columns, possibly with false positives. Using the
frequent element count algorithm described in
“<a class="reference external" href="http://dx.doi.org/10.1145/762471.762473">http://dx.doi.org/10.1145/762471.762473</a>, proposed by Karp, Schenker, and Papadimitriou”.
<a class="reference internal" href="#pyspark.sql.DataFrame.freqItems" title="pyspark.sql.DataFrame.freqItems"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.freqItems()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.freqItems" title="pyspark.sql.DataFrameStatFunctions.freqItems"><code class="xref py py-func docutils literal"><span class="pre">DataFrameStatFunctions.freqItems()</span></code></a> are aliases.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>cols</strong> – Names of the columns to calculate frequent items for as a list or tuple of
strings.</li>
<li><strong>support</strong> – The frequency with which to consider an item ‘frequent’. Default is 1%.
The support must be greater than 1e-4.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.groupBy">
<code class="descname">groupBy</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.groupBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.groupBy" title="Permalink to this definition">¶</a></dt>
<dd><p>Groups the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> using the specified columns,
so we can run aggregation on them. See <a class="reference internal" href="#pyspark.sql.GroupedData" title="pyspark.sql.GroupedData"><code class="xref py py-class docutils literal"><span class="pre">GroupedData</span></code></a>
for all the available aggregate functions.</p>
<p><a class="reference internal" href="#pyspark.sql.DataFrame.groupby" title="pyspark.sql.DataFrame.groupby"><code class="xref py py-func docutils literal"><span class="pre">groupby()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.DataFrame.groupBy" title="pyspark.sql.DataFrame.groupBy"><code class="xref py py-func docutils literal"><span class="pre">groupBy()</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of columns to group by.
Each element should be a column name (string) or an expression (<a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a>).</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">avg</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(avg(age)=3.5)]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s">'name'</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s">'age'</span><span class="p">:</span> <span class="s">'mean'</span><span class="p">})</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Alice', avg(age)=2.0), Row(name=u'Bob', avg(age)=5.0)]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">avg</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Alice', avg(age)=2.0), Row(name=u'Bob', avg(age)=5.0)]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">([</span><span class="s">'name'</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">])</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Bob', age=5, count=1), Row(name=u'Alice', age=2, count=1)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.groupby">
<code class="descname">groupby</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.DataFrame.groupby" title="Permalink to this definition">¶</a></dt>
<dd><p>Groups the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> using the specified columns,
so we can run aggregation on them. See <a class="reference internal" href="#pyspark.sql.GroupedData" title="pyspark.sql.GroupedData"><code class="xref py py-class docutils literal"><span class="pre">GroupedData</span></code></a>
for all the available aggregate functions.</p>
<p><a class="reference internal" href="#pyspark.sql.DataFrame.groupby" title="pyspark.sql.DataFrame.groupby"><code class="xref py py-func docutils literal"><span class="pre">groupby()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.DataFrame.groupBy" title="pyspark.sql.DataFrame.groupBy"><code class="xref py py-func docutils literal"><span class="pre">groupBy()</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of columns to group by.
Each element should be a column name (string) or an expression (<a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a>).</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">avg</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(avg(age)=3.5)]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s">'name'</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s">'age'</span><span class="p">:</span> <span class="s">'mean'</span><span class="p">})</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Alice', avg(age)=2.0), Row(name=u'Bob', avg(age)=5.0)]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">avg</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Alice', avg(age)=2.0), Row(name=u'Bob', avg(age)=5.0)]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">([</span><span class="s">'name'</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">])</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Bob', age=5, count=1), Row(name=u'Alice', age=2, count=1)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.head">
<code class="descname">head</code><span class="sig-paren">(</span><em>n=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.head"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.head" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the first <code class="docutils literal"><span class="pre">n</span></code> rows.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>n</strong> – int, default 1. Number of rows to return.</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body">If n is greater than 1, return a list of <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>.
If n is 1, return a single Row.</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="go">Row(age=2, name=u'Alice')</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="go">[Row(age=2, name=u'Alice')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.insertInto">
<code class="descname">insertInto</code><span class="sig-paren">(</span><em>tableName</em>, <em>overwrite=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.insertInto"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.insertInto" title="Permalink to this definition">¶</a></dt>
<dd><p>Inserts the contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> into the specified table.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 1.4, use <a class="reference internal" href="#pyspark.sql.DataFrameWriter.insertInto" title="pyspark.sql.DataFrameWriter.insertInto"><code class="xref py py-func docutils literal"><span class="pre">DataFrameWriter.insertInto()</span></code></a> instead.</p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.intersect">
<code class="descname">intersect</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.intersect"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.intersect" title="Permalink to this definition">¶</a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> containing rows only in
both this frame and another frame.</p>
<p>This is equivalent to <cite>INTERSECT</cite> in SQL.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.isLocal">
<code class="descname">isLocal</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.isLocal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.isLocal" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns <code class="docutils literal"><span class="pre">True</span></code> if the <a class="reference internal" href="#pyspark.sql.DataFrame.collect" title="pyspark.sql.DataFrame.collect"><code class="xref py py-func docutils literal"><span class="pre">collect()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrame.take" title="pyspark.sql.DataFrame.take"><code class="xref py py-func docutils literal"><span class="pre">take()</span></code></a> methods can be run locally
(without any Spark executors).</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.join">
<code class="descname">join</code><span class="sig-paren">(</span><em>other</em>, <em>on=None</em>, <em>how=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.join"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.join" title="Permalink to this definition">¶</a></dt>
<dd><p>Joins with another <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>, using the given join expression.</p>
<p>The following performs a full outer join between <code class="docutils literal"><span class="pre">df1</span></code> and <code class="docutils literal"><span class="pre">df2</span></code>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>other</strong> – Right side of the join</li>
<li><strong>on</strong> – a string for join column name, a list of column names,
, a join expression (Column) or a list of Columns.
If <cite>on</cite> is a string or a list of string indicating the name of the join column(s),
the column(s) must exist on both sides, and this performs an inner equi-join.</li>
<li><strong>how</strong> – str, default ‘inner’.
One of <cite>inner</cite>, <cite>outer</cite>, <cite>left_outer</cite>, <cite>right_outer</cite>, <cite>semijoin</cite>.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="n">df2</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="s">'outer'</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df2</span><span class="o">.</span><span class="n">height</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=None, height=80), Row(name=u'Alice', height=None), Row(name=u'Bob', height=85)]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">cond</span> <span class="o">=</span> <span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="n">df3</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">==</span> <span class="n">df3</span><span class="o">.</span><span class="n">age</span><span class="p">]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df3</span><span class="p">,</span> <span class="n">cond</span><span class="p">,</span> <span class="s">'outer'</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df3</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Bob', age=5), Row(name=u'Alice', age=2)]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="s">'name'</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df2</span><span class="o">.</span><span class="n">height</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Bob', height=85)]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df4</span><span class="p">,</span> <span class="p">[</span><span class="s">'name'</span><span class="p">,</span> <span class="s">'age'</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Bob', age=5)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.limit">
<code class="descname">limit</code><span class="sig-paren">(</span><em>num</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.limit"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.limit" title="Permalink to this definition">¶</a></dt>
<dd><p>Limits the result count to the number specified.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">limit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=u'Alice')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">limit</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.map">
<code class="descname">map</code><span class="sig-paren">(</span><em>f</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.map"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.map" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <code class="xref py py-class docutils literal"><span class="pre">RDD</span></code> by applying a the <code class="docutils literal"><span class="pre">f</span></code> function to each <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>.</p>
<p>This is a shorthand for <code class="docutils literal"><span class="pre">df.rdd.map()</span></code>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">p</span><span class="p">:</span> <span class="n">p</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[u'Alice', u'Bob']</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.mapPartitions">
<code class="descname">mapPartitions</code><span class="sig-paren">(</span><em>f</em>, <em>preservesPartitioning=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.mapPartitions"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.mapPartitions" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <code class="xref py py-class docutils literal"><span class="pre">RDD</span></code> by applying the <code class="docutils literal"><span class="pre">f</span></code> function to each partition.</p>
<p>This is a shorthand for <code class="docutils literal"><span class="pre">df.rdd.mapPartitions()</span></code>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">rdd</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span> <span class="mi">4</span><span class="p">)</span>
<span class="gp">>>> </span><span class="k">def</span> <span class="nf">f</span><span class="p">(</span><span class="n">iterator</span><span class="p">):</span> <span class="k">yield</span> <span class="mi">1</span>
<span class="gp">>>> </span><span class="n">rdd</span><span class="o">.</span><span class="n">mapPartitions</span><span class="p">(</span><span class="n">f</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="go">4</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.na">
<code class="descname">na</code><a class="headerlink" href="#pyspark.sql.DataFrame.na" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions" title="pyspark.sql.DataFrameNaFunctions"><code class="xref py py-class docutils literal"><span class="pre">DataFrameNaFunctions</span></code></a> for handling missing values.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.orderBy">
<code class="descname">orderBy</code><span class="sig-paren">(</span><em>*cols</em>, <em>**kwargs</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.DataFrame.orderBy" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> sorted by the specified column(s).</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>cols</strong> – list of <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> or column names to sort by.</li>
<li><strong>ascending</strong> – boolean or list of boolean (default True).
Sort ascending vs. descending. Specify list for multiple sort orders.
If a list is specified, length of the list must equal length of the <cite>cols</cite>.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">desc</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s">"age"</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">desc</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]</span>
<span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="kn">import</span> <span class="o">*</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">asc</span><span class="p">(</span><span class="s">"age"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="s">"age"</span><span class="p">),</span> <span class="s">"name"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">([</span><span class="s">"age"</span><span class="p">,</span> <span class="s">"name"</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.persist">
<code class="descname">persist</code><span class="sig-paren">(</span><em>storageLevel=StorageLevel(False</em>, <em>True</em>, <em>False</em>, <em>False</em>, <em>1)</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.persist"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.persist" title="Permalink to this definition">¶</a></dt>
<dd><p>Sets the storage level to persist its values across operations
after the first time it is computed. This can only be used to assign
a new storage level if the RDD does not have a storage level set yet.
If no storage level is specified defaults to (<code class="xref py py-class docutils literal"><span class="pre">MEMORY_ONLY_SER</span></code>).</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.printSchema">
<code class="descname">printSchema</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.printSchema"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.printSchema" title="Permalink to this definition">¶</a></dt>
<dd><p>Prints out the schema in the tree format.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">printSchema</span><span class="p">()</span>
<span class="go">root</span>
<span class="go"> |-- age: integer (nullable = true)</span>
<span class="go"> |-- name: string (nullable = true)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.randomSplit">
<code class="descname">randomSplit</code><span class="sig-paren">(</span><em>weights</em>, <em>seed=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.randomSplit"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.randomSplit" title="Permalink to this definition">¶</a></dt>
<dd><p>Randomly splits this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> with the provided weights.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>weights</strong> – list of doubles as weights with which to split the DataFrame. Weights will
be normalized if they don’t sum up to 1.0.</li>
<li><strong>seed</strong> – The seed for sampling.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">splits</span> <span class="o">=</span> <span class="n">df4</span><span class="o">.</span><span class="n">randomSplit</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">],</span> <span class="mi">24</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">splits</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">1</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">splits</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">3</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.rdd">
<code class="descname">rdd</code><a class="headerlink" href="#pyspark.sql.DataFrame.rdd" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the content as an <a class="reference internal" href="pyspark.html#pyspark.RDD" title="pyspark.RDD"><code class="xref py py-class docutils literal"><span class="pre">pyspark.RDD</span></code></a> of <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.registerAsTable">
<code class="descname">registerAsTable</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.registerAsTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.registerAsTable" title="Permalink to this definition">¶</a></dt>
<dd><div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 1.4, use <a class="reference internal" href="#pyspark.sql.DataFrame.registerTempTable" title="pyspark.sql.DataFrame.registerTempTable"><code class="xref py py-func docutils literal"><span class="pre">registerTempTable()</span></code></a> instead.</p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.registerTempTable">
<code class="descname">registerTempTable</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.registerTempTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.registerTempTable" title="Permalink to this definition">¶</a></dt>
<dd><p>Registers this RDD as a temporary table using the given name.</p>
<p>The lifetime of this temporary table is tied to the <a class="reference internal" href="#pyspark.sql.SQLContext" title="pyspark.sql.SQLContext"><code class="xref py py-class docutils literal"><span class="pre">SQLContext</span></code></a>
that was used to create this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">registerTempTable</span><span class="p">(</span><span class="s">"people"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df2</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"select * from people"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span> <span class="o">==</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">True</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.repartition">
<code class="descname">repartition</code><span class="sig-paren">(</span><em>numPartitions</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.repartition"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.repartition" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> that has exactly <code class="docutils literal"><span class="pre">numPartitions</span></code> partitions.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">repartition</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span><span class="o">.</span><span class="n">rdd</span><span class="o">.</span><span class="n">getNumPartitions</span><span class="p">()</span>
<span class="go">10</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.replace">
<code class="descname">replace</code><span class="sig-paren">(</span><em>to_replace</em>, <em>value</em>, <em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.replace"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.replace" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> replacing a value with another value.
<a class="reference internal" href="#pyspark.sql.DataFrame.replace" title="pyspark.sql.DataFrame.replace"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.replace()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions.replace" title="pyspark.sql.DataFrameNaFunctions.replace"><code class="xref py py-func docutils literal"><span class="pre">DataFrameNaFunctions.replace()</span></code></a> are
aliases of each other.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>to_replace</strong> – int, long, float, string, or list.
Value to be replaced.
If the value is a dict, then <cite>value</cite> is ignored and <cite>to_replace</cite> must be a
mapping from column name (string) to replacement value. The value to be
replaced must be an int, long, float, or string.</li>
<li><strong>value</strong> – int, long, float, string, or list.
Value to use to replace holes.
The replacement value must be an int, long, float, or string. If <cite>value</cite> is a
list or tuple, <cite>value</cite> should be of the same length with <cite>to_replace</cite>.</li>
<li><strong>subset</strong> – optional list of column names to consider.
Columns specified in subset that do not have matching data type are ignored.
For example, if <cite>value</cite> is a string, and subset contains a non-string column,
then the non-string column is simply ignored.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">20</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+------+-----+</span>
<span class="go">| age|height| name|</span>
<span class="go">+----+------+-----+</span>
<span class="go">| 20| 80|Alice|</span>
<span class="go">| 5| null| Bob|</span>
<span class="go">|null| null| Tom|</span>
<span class="go">|null| null| null|</span>
<span class="go">+----+------+-----+</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">replace</span><span class="p">([</span><span class="s">'Alice'</span><span class="p">,</span> <span class="s">'Bob'</span><span class="p">],</span> <span class="p">[</span><span class="s">'A'</span><span class="p">,</span> <span class="s">'B'</span><span class="p">],</span> <span class="s">'name'</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+------+----+</span>
<span class="go">| age|height|name|</span>
<span class="go">+----+------+----+</span>
<span class="go">| 10| 80| A|</span>
<span class="go">| 5| null| B|</span>
<span class="go">|null| null| Tom|</span>
<span class="go">|null| null|null|</span>
<span class="go">+----+------+----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.rollup">
<code class="descname">rollup</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.rollup"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.rollup" title="Permalink to this definition">¶</a></dt>
<dd><p>Create a multi-dimensional rollup for the current <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> using
the specified columns, so we can run aggregation on them.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">rollup</span><span class="p">(</span><span class="s">'name'</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-----+----+-----+</span>
<span class="go">| name| age|count|</span>
<span class="go">+-----+----+-----+</span>
<span class="go">|Alice|null| 1|</span>
<span class="go">| Bob| 5| 1|</span>
<span class="go">| Bob|null| 1|</span>
<span class="go">| null|null| 2|</span>
<span class="go">|Alice| 2| 1|</span>
<span class="go">+-----+----+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.sample">
<code class="descname">sample</code><span class="sig-paren">(</span><em>withReplacement</em>, <em>fraction</em>, <em>seed=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.sample"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.sample" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a sampled subset of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="bp">False</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mi">42</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">1</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.sampleBy">
<code class="descname">sampleBy</code><span class="sig-paren">(</span><em>col</em>, <em>fractions</em>, <em>seed=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.sampleBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.sampleBy" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a stratified sample without replacement based on the
fraction given on each stratum.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>col</strong> – column that defines strata</li>
<li><strong>fractions</strong> – sampling fraction for each stratum. If a stratum is not
specified, we treat its fraction as zero.</li>
<li><strong>seed</strong> – random seed</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">a new DataFrame that represents the stratified sample</p>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="kn">import</span> <span class="n">col</span>
<span class="gp">>>> </span><span class="n">dataset</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">((</span><span class="n">col</span><span class="p">(</span><span class="s">"id"</span><span class="p">)</span> <span class="o">%</span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"key"</span><span class="p">))</span>
<span class="gp">>>> </span><span class="n">sampled</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">sampleBy</span><span class="p">(</span><span class="s">"key"</span><span class="p">,</span> <span class="n">fractions</span><span class="o">=</span><span class="p">{</span><span class="mi">0</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mi">1</span><span class="p">:</span> <span class="mf">0.2</span><span class="p">},</span> <span class="n">seed</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">sampled</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s">"key"</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s">"key"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|key|count|</span>
<span class="go">+---+-----+</span>
<span class="go">| 0| 3|</span>
<span class="go">| 1| 8|</span>
<span class="go">+---+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.save">
<code class="descname">save</code><span class="sig-paren">(</span><em>path=None</em>, <em>source=None</em>, <em>mode='error'</em>, <em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.save"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.save" title="Permalink to this definition">¶</a></dt>
<dd><p>Saves the contents of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to a data source.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 1.4, use <a class="reference internal" href="#pyspark.sql.DataFrameWriter.save" title="pyspark.sql.DataFrameWriter.save"><code class="xref py py-func docutils literal"><span class="pre">DataFrameWriter.save()</span></code></a> instead.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.saveAsParquetFile">
<code class="descname">saveAsParquetFile</code><span class="sig-paren">(</span><em>path</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.saveAsParquetFile"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.saveAsParquetFile" title="Permalink to this definition">¶</a></dt>
<dd><p>Saves the contents as a Parquet file, preserving the schema.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 1.4, use <a class="reference internal" href="#pyspark.sql.DataFrameWriter.parquet" title="pyspark.sql.DataFrameWriter.parquet"><code class="xref py py-func docutils literal"><span class="pre">DataFrameWriter.parquet()</span></code></a> instead.</p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.saveAsTable">
<code class="descname">saveAsTable</code><span class="sig-paren">(</span><em>tableName</em>, <em>source=None</em>, <em>mode='error'</em>, <em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.saveAsTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.saveAsTable" title="Permalink to this definition">¶</a></dt>
<dd><p>Saves the contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to a data source as a table.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 1.4, use <a class="reference internal" href="#pyspark.sql.DataFrameWriter.saveAsTable" title="pyspark.sql.DataFrameWriter.saveAsTable"><code class="xref py py-func docutils literal"><span class="pre">DataFrameWriter.saveAsTable()</span></code></a> instead.</p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.schema">
<code class="descname">schema</code><a class="headerlink" href="#pyspark.sql.DataFrame.schema" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the schema of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> as a <a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">types.StructType</span></code></a>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">schema</span>
<span class="go">StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.select">
<code class="descname">select</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.select"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.select" title="Permalink to this definition">¶</a></dt>
<dd><p>Projects a set of expressions and returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of column names (string) or expressions (<a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a>).
If one of the column names is ‘*’, that column is expanded to include all columns
in the current DataFrame.</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">'*'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">'name'</span><span class="p">,</span> <span class="s">'age'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">+</span> <span class="mi">10</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'age'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.selectExpr">
<code class="descname">selectExpr</code><span class="sig-paren">(</span><em>*expr</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.selectExpr"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.selectExpr" title="Permalink to this definition">¶</a></dt>
<dd><p>Projects a set of SQL expressions and returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<p>This is a variant of <a class="reference internal" href="#pyspark.sql.DataFrame.select" title="pyspark.sql.DataFrame.select"><code class="xref py py-func docutils literal"><span class="pre">select()</span></code></a> that accepts SQL expressions.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">selectExpr</span><span class="p">(</span><span class="s">"age * 2"</span><span class="p">,</span> <span class="s">"abs(age)"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row((age * 2)=4, 'abs(age)=2), Row((age * 2)=10, 'abs(age)=5)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.show">
<code class="descname">show</code><span class="sig-paren">(</span><em>n=20</em>, <em>truncate=True</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.show"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.show" title="Permalink to this definition">¶</a></dt>
<dd><p>Prints the first <code class="docutils literal"><span class="pre">n</span></code> rows to the console.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>n</strong> – Number of rows to show.</li>
<li><strong>truncate</strong> – Whether truncate long strings and align cells right.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span>
<span class="go">DataFrame[age: int, name: string]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|age| name|</span>
<span class="go">+---+-----+</span>
<span class="go">| 2|Alice|</span>
<span class="go">| 5| Bob|</span>
<span class="go">+---+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.sort">
<code class="descname">sort</code><span class="sig-paren">(</span><em>*cols</em>, <em>**kwargs</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.sort"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.sort" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> sorted by the specified column(s).</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>cols</strong> – list of <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> or column names to sort by.</li>
<li><strong>ascending</strong> – boolean or list of boolean (default True).
Sort ascending vs. descending. Specify list for multiple sort orders.
If a list is specified, length of the list must equal length of the <cite>cols</cite>.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">desc</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s">"age"</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">desc</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]</span>
<span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="kn">import</span> <span class="o">*</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">asc</span><span class="p">(</span><span class="s">"age"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="s">"age"</span><span class="p">),</span> <span class="s">"name"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">([</span><span class="s">"age"</span><span class="p">,</span> <span class="s">"name"</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.stat">
<code class="descname">stat</code><a class="headerlink" href="#pyspark.sql.DataFrame.stat" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions" title="pyspark.sql.DataFrameStatFunctions"><code class="xref py py-class docutils literal"><span class="pre">DataFrameStatFunctions</span></code></a> for statistic functions.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.subtract">
<code class="descname">subtract</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.subtract"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.subtract" title="Permalink to this definition">¶</a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> containing rows in this frame
but not in another frame.</p>
<p>This is equivalent to <cite>EXCEPT</cite> in SQL.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.take">
<code class="descname">take</code><span class="sig-paren">(</span><em>num</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.take"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.take" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the first <code class="docutils literal"><span class="pre">num</span></code> rows as a <code class="xref py py-class docutils literal"><span class="pre">list</span></code> of <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal"><span class="pre">Row</span></code></a>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">take</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="go">[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.toJSON">
<code class="descname">toJSON</code><span class="sig-paren">(</span><em>use_unicode=True</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.toJSON"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.toJSON" title="Permalink to this definition">¶</a></dt>
<dd><p>Converts a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> into a <code class="xref py py-class docutils literal"><span class="pre">RDD</span></code> of string.</p>
<p>Each row is turned into a JSON document as one element in the returned RDD.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">toJSON</span><span class="p">()</span><span class="o">.</span><span class="n">first</span><span class="p">()</span>
<span class="go">u'{"age":2,"name":"Alice"}'</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.toPandas">
<code class="descname">toPandas</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.toPandas"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.toPandas" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> as Pandas <code class="docutils literal"><span class="pre">pandas.DataFrame</span></code>.</p>
<p>This is only available if Pandas is installed and available.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">toPandas</span><span class="p">()</span>
<span class="go"> age name</span>
<span class="go">0 2 Alice</span>
<span class="go">1 5 Bob</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.unionAll">
<code class="descname">unionAll</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.unionAll"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.unionAll" title="Permalink to this definition">¶</a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> containing union of rows in this
frame and another frame.</p>
<p>This is equivalent to <cite>UNION ALL</cite> in SQL.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.unpersist">
<code class="descname">unpersist</code><span class="sig-paren">(</span><em>blocking=True</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.unpersist"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.unpersist" title="Permalink to this definition">¶</a></dt>
<dd><p>Marks the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> as non-persistent, and remove all blocks for it from
memory and disk.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.where">
<code class="descname">where</code><span class="sig-paren">(</span><em>condition</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.DataFrame.where" title="Permalink to this definition">¶</a></dt>
<dd><p>Filters rows using the given condition.</p>
<p><a class="reference internal" href="#pyspark.sql.DataFrame.where" title="pyspark.sql.DataFrame.where"><code class="xref py py-func docutils literal"><span class="pre">where()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.DataFrame.filter" title="pyspark.sql.DataFrame.filter"><code class="xref py py-func docutils literal"><span class="pre">filter()</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>condition</strong> – a <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> of <a class="reference internal" href="#pyspark.sql.types.BooleanType" title="pyspark.sql.types.BooleanType"><code class="xref py py-class docutils literal"><span class="pre">types.BooleanType</span></code></a>
or a string of SQL expression.</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">></span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">==</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=u'Alice')]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="s">"age > 3"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s">"age = 2"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=u'Alice')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.withColumn">
<code class="descname">withColumn</code><span class="sig-paren">(</span><em>colName</em>, <em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.withColumn"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.withColumn" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> by adding a column or replacing the
existing column that has the same name.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>colName</strong> – string, name of the new column.</li>
<li><strong>col</strong> – a <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> expression for the new column.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">withColumn</span><span class="p">(</span><span class="s">'age2'</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">+</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrame.withColumnRenamed">
<code class="descname">withColumnRenamed</code><span class="sig-paren">(</span><em>existing</em>, <em>new</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrame.withColumnRenamed"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.withColumnRenamed" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> by renaming an existing column.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>existing</strong> – string, name of the existing column to rename.</li>
<li><strong>col</strong> – string, new name of the column.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">withColumnRenamed</span><span class="p">(</span><span class="s">'age'</span><span class="p">,</span> <span class="s">'age2'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age2=2, name=u'Alice'), Row(age2=5, name=u'Bob')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="attribute">
<dt id="pyspark.sql.DataFrame.write">
<code class="descname">write</code><a class="headerlink" href="#pyspark.sql.DataFrame.write" title="Permalink to this definition">¶</a></dt>
<dd><p>Interface for saving the content of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> out into external storage.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><a class="reference internal" href="#pyspark.sql.DataFrameWriter" title="pyspark.sql.DataFrameWriter"><code class="xref py py-class docutils literal"><span class="pre">DataFrameWriter</span></code></a></td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.GroupedData">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">GroupedData</code><span class="sig-paren">(</span><em>jdf</em>, <em>sql_ctx</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData" title="Permalink to this definition">¶</a></dt>
<dd><p>A set of methods for aggregations on a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>,
created by <a class="reference internal" href="#pyspark.sql.DataFrame.groupBy" title="pyspark.sql.DataFrame.groupBy"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.groupBy()</span></code></a>.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Experimental</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.GroupedData.agg">
<code class="descname">agg</code><span class="sig-paren">(</span><em>*exprs</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData.agg"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData.agg" title="Permalink to this definition">¶</a></dt>
<dd><p>Compute aggregates and returns the result as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<p>The available aggregate functions are <cite>avg</cite>, <cite>max</cite>, <cite>min</cite>, <cite>sum</cite>, <cite>count</cite>.</p>
<p>If <code class="docutils literal"><span class="pre">exprs</span></code> is a single <code class="xref py py-class docutils literal"><span class="pre">dict</span></code> mapping from string to string, then the key
is the column to perform aggregation on, and the value is the aggregate function.</p>
<p>Alternatively, <code class="docutils literal"><span class="pre">exprs</span></code> can also be a list of aggregate <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> expressions.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>exprs</strong> – a dict mapping from column name (string) to aggregate functions (string),
or a list of <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a>.</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">gdf</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">gdf</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s">"*"</span><span class="p">:</span> <span class="s">"count"</span><span class="p">})</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Alice', count(1)=1), Row(name=u'Bob', count(1)=1)]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">functions</span> <span class="k">as</span> <span class="n">F</span>
<span class="gp">>>> </span><span class="n">gdf</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=u'Alice', min(age)=2), Row(name=u'Bob', min(age)=5)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.GroupedData.avg">
<code class="descname">avg</code><span class="sig-paren">(</span><em>*args</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData.avg"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData.avg" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes average values for each numeric columns for each group.</p>
<p><a class="reference internal" href="#pyspark.sql.GroupedData.mean" title="pyspark.sql.GroupedData.mean"><code class="xref py py-func docutils literal"><span class="pre">mean()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.GroupedData.avg" title="pyspark.sql.GroupedData.avg"><code class="xref py py-func docutils literal"><span class="pre">avg()</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of column names (string). Non-numeric columns are ignored.</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">avg</span><span class="p">(</span><span class="s">'age'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(avg(age)=3.5)]</span>
<span class="gp">>>> </span><span class="n">df3</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">avg</span><span class="p">(</span><span class="s">'age'</span><span class="p">,</span> <span class="s">'height'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(avg(age)=3.5, avg(height)=82.5)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.GroupedData.count">
<code class="descname">count</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData.count"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData.count" title="Permalink to this definition">¶</a></dt>
<dd><p>Counts the number of records for each group.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, count=1), Row(age=5, count=1)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.GroupedData.max">
<code class="descname">max</code><span class="sig-paren">(</span><em>*args</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData.max"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData.max" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the max value for each numeric columns for each group.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">max</span><span class="p">(</span><span class="s">'age'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(max(age)=5)]</span>
<span class="gp">>>> </span><span class="n">df3</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">max</span><span class="p">(</span><span class="s">'age'</span><span class="p">,</span> <span class="s">'height'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(max(age)=5, max(height)=85)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.GroupedData.mean">
<code class="descname">mean</code><span class="sig-paren">(</span><em>*args</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData.mean"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData.mean" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes average values for each numeric columns for each group.</p>
<p><a class="reference internal" href="#pyspark.sql.GroupedData.mean" title="pyspark.sql.GroupedData.mean"><code class="xref py py-func docutils literal"><span class="pre">mean()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.GroupedData.avg" title="pyspark.sql.GroupedData.avg"><code class="xref py py-func docutils literal"><span class="pre">avg()</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of column names (string). Non-numeric columns are ignored.</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="s">'age'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(avg(age)=3.5)]</span>
<span class="gp">>>> </span><span class="n">df3</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="s">'age'</span><span class="p">,</span> <span class="s">'height'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(avg(age)=3.5, avg(height)=82.5)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.GroupedData.min">
<code class="descname">min</code><span class="sig-paren">(</span><em>*args</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData.min"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData.min" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the min value for each numeric column for each group.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of column names (string). Non-numeric columns are ignored.</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="s">'age'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(min(age)=2)]</span>
<span class="gp">>>> </span><span class="n">df3</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="s">'age'</span><span class="p">,</span> <span class="s">'height'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(min(age)=2, min(height)=80)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.GroupedData.sum">
<code class="descname">sum</code><span class="sig-paren">(</span><em>*args</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/group.html#GroupedData.sum"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.GroupedData.sum" title="Permalink to this definition">¶</a></dt>
<dd><p>Compute the sum for each numeric columns for each group.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of column names (string). Non-numeric columns are ignored.</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="s">'age'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(sum(age)=7)]</span>
<span class="gp">>>> </span><span class="n">df3</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="s">'age'</span><span class="p">,</span> <span class="s">'height'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(sum(age)=7, sum(height)=165)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.Column">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">Column</code><span class="sig-paren">(</span><em>jc</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column" title="Permalink to this definition">¶</a></dt>
<dd><p>A column in a DataFrame.</p>
<p><a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> instances can be created by:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="c"># 1. Select a column out of a DataFrame</span>
<span class="n">df</span><span class="o">.</span><span class="n">colName</span>
<span class="n">df</span><span class="p">[</span><span class="s">"colName"</span><span class="p">]</span>
<span class="c"># 2. Create from an expression</span>
<span class="n">df</span><span class="o">.</span><span class="n">colName</span> <span class="o">+</span> <span class="mi">1</span>
<span class="mi">1</span> <span class="o">/</span> <span class="n">df</span><span class="o">.</span><span class="n">colName</span>
</pre></div>
</div>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Experimental</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.Column.alias">
<code class="descname">alias</code><span class="sig-paren">(</span><em>*alias</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.alias"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.alias" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns this column aliased with a new name or names (in the case of expressions that
return more than one column, such as explode).</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"age2"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age2=2), Row(age2=5)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.asc">
<code class="descname">asc</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.asc" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a sort expression based on the ascending order of the given column name.</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.astype">
<code class="descname">astype</code><span class="sig-paren">(</span><em>dataType</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.astype" title="Permalink to this definition">¶</a></dt>
<dd><p>Convert the column into type <code class="docutils literal"><span class="pre">dataType</span></code>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="s">"string"</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'ages'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(ages=u'2'), Row(ages=u'5')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="n">StringType</span><span class="p">())</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'ages'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(ages=u'2'), Row(ages=u'5')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.between">
<code class="descname">between</code><span class="sig-paren">(</span><em>lowerBound</em>, <em>upperBound</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.between"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.between" title="Permalink to this definition">¶</a></dt>
<dd><p>A boolean expression that is evaluated to true if the value of this
expression is between the given columns.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">between</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-----+--------------------------+</span>
<span class="go">| name|((age >= 2) && (age <= 4))|</span>
<span class="go">+-----+--------------------------+</span>
<span class="go">|Alice| true|</span>
<span class="go">| Bob| false|</span>
<span class="go">+-----+--------------------------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.bitwiseAND">
<code class="descname">bitwiseAND</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.bitwiseAND" title="Permalink to this definition">¶</a></dt>
<dd><p>binary operator</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.bitwiseOR">
<code class="descname">bitwiseOR</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.bitwiseOR" title="Permalink to this definition">¶</a></dt>
<dd><p>binary operator</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.bitwiseXOR">
<code class="descname">bitwiseXOR</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.bitwiseXOR" title="Permalink to this definition">¶</a></dt>
<dd><p>binary operator</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.cast">
<code class="descname">cast</code><span class="sig-paren">(</span><em>dataType</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.cast"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.cast" title="Permalink to this definition">¶</a></dt>
<dd><p>Convert the column into type <code class="docutils literal"><span class="pre">dataType</span></code>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="s">"string"</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'ages'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(ages=u'2'), Row(ages=u'5')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="n">StringType</span><span class="p">())</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'ages'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(ages=u'2'), Row(ages=u'5')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.desc">
<code class="descname">desc</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.desc" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a sort expression based on the descending order of the given column name.</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.endswith">
<code class="descname">endswith</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.endswith" title="Permalink to this definition">¶</a></dt>
<dd><p>binary operator</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.getField">
<code class="descname">getField</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.getField"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.getField" title="Permalink to this definition">¶</a></dt>
<dd><p>An expression that gets a field by name in a StructField.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">Row</span>
<span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="n">Row</span><span class="p">(</span><span class="n">r</span><span class="o">=</span><span class="n">Row</span><span class="p">(</span><span class="n">a</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="s">"b"</span><span class="p">))])</span><span class="o">.</span><span class="n">toDF</span><span class="p">()</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">r</span><span class="o">.</span><span class="n">getField</span><span class="p">(</span><span class="s">"b"</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+</span>
<span class="go">|r[b]|</span>
<span class="go">+----+</span>
<span class="go">| b|</span>
<span class="go">+----+</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">r</span><span class="o">.</span><span class="n">a</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+</span>
<span class="go">|r[a]|</span>
<span class="go">+----+</span>
<span class="go">| 1|</span>
<span class="go">+----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.getItem">
<code class="descname">getItem</code><span class="sig-paren">(</span><em>key</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.getItem"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.getItem" title="Permalink to this definition">¶</a></dt>
<dd><p>An expression that gets an item at position <code class="docutils literal"><span class="pre">ordinal</span></code> out of a list,
or gets an item by key out of a dict.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span> <span class="p">{</span><span class="s">"key"</span><span class="p">:</span> <span class="s">"value"</span><span class="p">})])</span><span class="o">.</span><span class="n">toDF</span><span class="p">([</span><span class="s">"l"</span><span class="p">,</span> <span class="s">"d"</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">l</span><span class="o">.</span><span class="n">getItem</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="o">.</span><span class="n">getItem</span><span class="p">(</span><span class="s">"key"</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+------+</span>
<span class="go">|l[0]|d[key]|</span>
<span class="go">+----+------+</span>
<span class="go">| 1| value|</span>
<span class="go">+----+------+</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">l</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">[</span><span class="s">"key"</span><span class="p">])</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+------+</span>
<span class="go">|l[0]|d[key]|</span>
<span class="go">+----+------+</span>
<span class="go">| 1| value|</span>
<span class="go">+----+------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.inSet">
<code class="descname">inSet</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.inSet"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.inSet" title="Permalink to this definition">¶</a></dt>
<dd><p>A boolean expression that is evaluated to true if the value of this
expression is contained by the evaluated values of the arguments.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">inSet</span><span class="p">(</span><span class="s">"Bob"</span><span class="p">,</span> <span class="s">"Mike"</span><span class="p">)]</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">inSet</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">])]</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=u'Alice')]</span>
</pre></div>
</div>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Deprecated in 1.5, use <a class="reference internal" href="#pyspark.sql.Column.isin" title="pyspark.sql.Column.isin"><code class="xref py py-func docutils literal"><span class="pre">Column.isin()</span></code></a> instead.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.isNotNull">
<code class="descname">isNotNull</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.isNotNull" title="Permalink to this definition">¶</a></dt>
<dd><p>True if the current expression is not null.</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.isNull">
<code class="descname">isNull</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.isNull" title="Permalink to this definition">¶</a></dt>
<dd><p>True if the current expression is null.</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.isin">
<code class="descname">isin</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.isin"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.isin" title="Permalink to this definition">¶</a></dt>
<dd><p>A boolean expression that is evaluated to true if the value of this
expression is contained by the evaluated values of the arguments.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="s">"Bob"</span><span class="p">,</span> <span class="s">"Mike"</span><span class="p">)]</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name=u'Bob')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">isin</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">])]</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name=u'Alice')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.like">
<code class="descname">like</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.like" title="Permalink to this definition">¶</a></dt>
<dd><p>binary operator</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.otherwise">
<code class="descname">otherwise</code><span class="sig-paren">(</span><em>value</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.otherwise"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.otherwise" title="Permalink to this definition">¶</a></dt>
<dd><p>Evaluates a list of conditions and returns one of multiple possible result expressions.
If <a class="reference internal" href="#pyspark.sql.Column.otherwise" title="pyspark.sql.Column.otherwise"><code class="xref py py-func docutils literal"><span class="pre">Column.otherwise()</span></code></a> is not invoked, None is returned for unmatched conditions.</p>
<p>See <a class="reference internal" href="#pyspark.sql.functions.when" title="pyspark.sql.functions.when"><code class="xref py py-func docutils literal"><span class="pre">pyspark.sql.functions.when()</span></code></a> for example usage.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>value</strong> – a literal value, or a <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> expression.</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">functions</span> <span class="k">as</span> <span class="n">F</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="n">when</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">></span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">otherwise</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-----+---------------------------------+</span>
<span class="go">| name|CASE WHEN (age > 3) THEN 1 ELSE 0|</span>
<span class="go">+-----+---------------------------------+</span>
<span class="go">|Alice| 0|</span>
<span class="go">| Bob| 1|</span>
<span class="go">+-----+---------------------------------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.over">
<code class="descname">over</code><span class="sig-paren">(</span><em>window</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.over"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.over" title="Permalink to this definition">¶</a></dt>
<dd><p>Define a windowing column.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>window</strong> – a <a class="reference internal" href="#pyspark.sql.WindowSpec" title="pyspark.sql.WindowSpec"><code class="xref py py-class docutils literal"><span class="pre">WindowSpec</span></code></a></td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body">a Column</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">Window</span>
<span class="gp">>>> </span><span class="n">window</span> <span class="o">=</span> <span class="n">Window</span><span class="o">.</span><span class="n">partitionBy</span><span class="p">(</span><span class="s">"name"</span><span class="p">)</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s">"age"</span><span class="p">)</span><span class="o">.</span><span class="n">rowsBetween</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="kn">import</span> <span class="n">rank</span><span class="p">,</span> <span class="nb">min</span>
<span class="gp">>>> </span><span class="c"># df.select(rank().over(window), min('age').over(window))</span>
</pre></div>
</div>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Window functions is only supported with HiveContext in 1.4</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.rlike">
<code class="descname">rlike</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.rlike" title="Permalink to this definition">¶</a></dt>
<dd><p>binary operator</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.startswith">
<code class="descname">startswith</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.Column.startswith" title="Permalink to this definition">¶</a></dt>
<dd><p>binary operator</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.substr">
<code class="descname">substr</code><span class="sig-paren">(</span><em>startPos</em>, <em>length</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.substr"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.substr" title="Permalink to this definition">¶</a></dt>
<dd><p>Return a <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> which is a substring of the column.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>startPos</strong> – start position (int or Column)</li>
<li><strong>length</strong> – length of the substring (int or Column)</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">substr</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"col"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(col=u'Ali'), Row(col=u'Bob')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.Column.when">
<code class="descname">when</code><span class="sig-paren">(</span><em>condition</em>, <em>value</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/column.html#Column.when"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Column.when" title="Permalink to this definition">¶</a></dt>
<dd><p>Evaluates a list of conditions and returns one of multiple possible result expressions.
If <a class="reference internal" href="#pyspark.sql.Column.otherwise" title="pyspark.sql.Column.otherwise"><code class="xref py py-func docutils literal"><span class="pre">Column.otherwise()</span></code></a> is not invoked, None is returned for unmatched conditions.</p>
<p>See <a class="reference internal" href="#pyspark.sql.functions.when" title="pyspark.sql.functions.when"><code class="xref py py-func docutils literal"><span class="pre">pyspark.sql.functions.when()</span></code></a> for example usage.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>condition</strong> – a boolean <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> expression.</li>
<li><strong>value</strong> – a literal value, or a <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal"><span class="pre">Column</span></code></a> expression.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">functions</span> <span class="k">as</span> <span class="n">F</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">F</span><span class="o">.</span><span class="n">when</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">></span> <span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">when</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o"><</span> <span class="mi">3</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">otherwise</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-----+--------------------------------------------------------+</span>
<span class="go">| name|CASE WHEN (age > 4) THEN 1 WHEN (age < 3) THEN -1 ELSE 0|</span>
<span class="go">+-----+--------------------------------------------------------+</span>
<span class="go">|Alice| -1|</span>
<span class="go">| Bob| 1|</span>
<span class="go">+-----+--------------------------------------------------------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.Row">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">Row</code><a class="reference internal" href="_modules/pyspark/sql/types.html#Row"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Row" title="Permalink to this definition">¶</a></dt>
<dd><p>A row in <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>. The fields in it can be accessed like attributes.</p>
<p>Row can be used to create a row object by using named arguments,
the fields will be sorted by names.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">row</span> <span class="o">=</span> <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">"Alice"</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">11</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">row</span>
<span class="go">Row(age=11, name='Alice')</span>
<span class="gp">>>> </span><span class="n">row</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">row</span><span class="o">.</span><span class="n">age</span>
<span class="go">('Alice', 11)</span>
</pre></div>
</div>
<p>Row also can be used to create another Row like class, then it
could be used to create Row objects, such as</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">Person</span> <span class="o">=</span> <span class="n">Row</span><span class="p">(</span><span class="s">"name"</span><span class="p">,</span> <span class="s">"age"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">Person</span>
<span class="go"><Row(name, age)></span>
<span class="gp">>>> </span><span class="n">Person</span><span class="p">(</span><span class="s">"Alice"</span><span class="p">,</span> <span class="mi">11</span><span class="p">)</span>
<span class="go">Row(name='Alice', age=11)</span>
</pre></div>
</div>
<dl class="method">
<dt id="pyspark.sql.Row.asDict">
<code class="descname">asDict</code><span class="sig-paren">(</span><em>recursive=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#Row.asDict"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Row.asDict" title="Permalink to this definition">¶</a></dt>
<dd><p>Return as an dict</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>recursive</strong> – turns the nested Row as dict (default: False).</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">"Alice"</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">11</span><span class="p">)</span><span class="o">.</span><span class="n">asDict</span><span class="p">()</span> <span class="o">==</span> <span class="p">{</span><span class="s">'name'</span><span class="p">:</span> <span class="s">'Alice'</span><span class="p">,</span> <span class="s">'age'</span><span class="p">:</span> <span class="mi">11</span><span class="p">}</span>
<span class="go">True</span>
<span class="gp">>>> </span><span class="n">row</span> <span class="o">=</span> <span class="n">Row</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">value</span><span class="o">=</span><span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'a'</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
<span class="gp">>>> </span><span class="n">row</span><span class="o">.</span><span class="n">asDict</span><span class="p">()</span> <span class="o">==</span> <span class="p">{</span><span class="s">'key'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s">'value'</span><span class="p">:</span> <span class="n">Row</span><span class="p">(</span><span class="n">age</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'a'</span><span class="p">)}</span>
<span class="go">True</span>
<span class="gp">>>> </span><span class="n">row</span><span class="o">.</span><span class="n">asDict</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span> <span class="o">==</span> <span class="p">{</span><span class="s">'key'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s">'value'</span><span class="p">:</span> <span class="p">{</span><span class="s">'name'</span><span class="p">:</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'age'</span><span class="p">:</span> <span class="mi">2</span><span class="p">}}</span>
<span class="go">True</span>
</pre></div>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.DataFrameNaFunctions">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">DataFrameNaFunctions</code><span class="sig-paren">(</span><em>df</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameNaFunctions"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameNaFunctions" title="Permalink to this definition">¶</a></dt>
<dd><p>Functionality for working with missing data in <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.DataFrameNaFunctions.drop">
<code class="descname">drop</code><span class="sig-paren">(</span><em>how='any'</em>, <em>thresh=None</em>, <em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameNaFunctions.drop"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameNaFunctions.drop" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> omitting rows with null values.
<a class="reference internal" href="#pyspark.sql.DataFrame.dropna" title="pyspark.sql.DataFrame.dropna"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.dropna()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions.drop" title="pyspark.sql.DataFrameNaFunctions.drop"><code class="xref py py-func docutils literal"><span class="pre">DataFrameNaFunctions.drop()</span></code></a> are aliases of each other.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>how</strong> – ‘any’ or ‘all’.
If ‘any’, drop a row if it contains any nulls.
If ‘all’, drop a row only if all its values are null.</li>
<li><strong>thresh</strong> – int, default None
If specified, drop rows that have less than <cite>thresh</cite> non-null values.
This overwrites the <cite>how</cite> parameter.</li>
<li><strong>subset</strong> – optional list of column names to consider.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">drop</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">| 10| 80|Alice|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameNaFunctions.fill">
<code class="descname">fill</code><span class="sig-paren">(</span><em>value</em>, <em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameNaFunctions.fill"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameNaFunctions.fill" title="Permalink to this definition">¶</a></dt>
<dd><p>Replace null values, alias for <code class="docutils literal"><span class="pre">na.fill()</span></code>.
<a class="reference internal" href="#pyspark.sql.DataFrame.fillna" title="pyspark.sql.DataFrame.fillna"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.fillna()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions.fill" title="pyspark.sql.DataFrameNaFunctions.fill"><code class="xref py py-func docutils literal"><span class="pre">DataFrameNaFunctions.fill()</span></code></a> are aliases of each other.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>value</strong> – int, long, float, string, or dict.
Value to replace null values with.
If the value is a dict, then <cite>subset</cite> is ignored and <cite>value</cite> must be a mapping
from column name (string) to replacement value. The replacement value must be
an int, long, float, or string.</li>
<li><strong>subset</strong> – optional list of column names to consider.
Columns specified in subset that do not have matching data type are ignored.
For example, if <cite>value</cite> is a string, and subset contains a non-string column,
then the non-string column is simply ignored.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">fill</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">| 10| 80|Alice|</span>
<span class="go">| 5| 50| Bob|</span>
<span class="go">| 50| 50| Tom|</span>
<span class="go">| 50| 50| null|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">fill</span><span class="p">({</span><span class="s">'age'</span><span class="p">:</span> <span class="mi">50</span><span class="p">,</span> <span class="s">'name'</span><span class="p">:</span> <span class="s">'unknown'</span><span class="p">})</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-------+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-------+</span>
<span class="go">| 10| 80| Alice|</span>
<span class="go">| 5| null| Bob|</span>
<span class="go">| 50| null| Tom|</span>
<span class="go">| 50| null|unknown|</span>
<span class="go">+---+------+-------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.1.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameNaFunctions.replace">
<code class="descname">replace</code><span class="sig-paren">(</span><em>to_replace</em>, <em>value</em>, <em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameNaFunctions.replace"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameNaFunctions.replace" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> replacing a value with another value.
<a class="reference internal" href="#pyspark.sql.DataFrame.replace" title="pyspark.sql.DataFrame.replace"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.replace()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions.replace" title="pyspark.sql.DataFrameNaFunctions.replace"><code class="xref py py-func docutils literal"><span class="pre">DataFrameNaFunctions.replace()</span></code></a> are
aliases of each other.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>to_replace</strong> – int, long, float, string, or list.
Value to be replaced.
If the value is a dict, then <cite>value</cite> is ignored and <cite>to_replace</cite> must be a
mapping from column name (string) to replacement value. The value to be
replaced must be an int, long, float, or string.</li>
<li><strong>value</strong> – int, long, float, string, or list.
Value to use to replace holes.
The replacement value must be an int, long, float, or string. If <cite>value</cite> is a
list or tuple, <cite>value</cite> should be of the same length with <cite>to_replace</cite>.</li>
<li><strong>subset</strong> – optional list of column names to consider.
Columns specified in subset that do not have matching data type are ignored.
For example, if <cite>value</cite> is a string, and subset contains a non-string column,
then the non-string column is simply ignored.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">20</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+------+-----+</span>
<span class="go">| age|height| name|</span>
<span class="go">+----+------+-----+</span>
<span class="go">| 20| 80|Alice|</span>
<span class="go">| 5| null| Bob|</span>
<span class="go">|null| null| Tom|</span>
<span class="go">|null| null| null|</span>
<span class="go">+----+------+-----+</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">replace</span><span class="p">([</span><span class="s">'Alice'</span><span class="p">,</span> <span class="s">'Bob'</span><span class="p">],</span> <span class="p">[</span><span class="s">'A'</span><span class="p">,</span> <span class="s">'B'</span><span class="p">],</span> <span class="s">'name'</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+------+----+</span>
<span class="go">| age|height|name|</span>
<span class="go">+----+------+----+</span>
<span class="go">| 10| 80| A|</span>
<span class="go">| 5| null| B|</span>
<span class="go">|null| null| Tom|</span>
<span class="go">|null| null|null|</span>
<span class="go">+----+------+----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.DataFrameStatFunctions">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">DataFrameStatFunctions</code><span class="sig-paren">(</span><em>df</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameStatFunctions"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameStatFunctions" title="Permalink to this definition">¶</a></dt>
<dd><p>Functionality for statistic functions with <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.DataFrameStatFunctions.corr">
<code class="descname">corr</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em>, <em>method=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameStatFunctions.corr"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameStatFunctions.corr" title="Permalink to this definition">¶</a></dt>
<dd><p>Calculates the correlation of two columns of a DataFrame as a double value.
Currently only supports the Pearson Correlation Coefficient.
<a class="reference internal" href="#pyspark.sql.DataFrame.corr" title="pyspark.sql.DataFrame.corr"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.corr()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.corr" title="pyspark.sql.DataFrameStatFunctions.corr"><code class="xref py py-func docutils literal"><span class="pre">DataFrameStatFunctions.corr()</span></code></a> are aliases of each other.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col1</strong> – The name of the first column</li>
<li><strong>col2</strong> – The name of the second column</li>
<li><strong>method</strong> – The correlation method. Currently only supports “pearson”</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameStatFunctions.cov">
<code class="descname">cov</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameStatFunctions.cov"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameStatFunctions.cov" title="Permalink to this definition">¶</a></dt>
<dd><p>Calculate the sample covariance for the given columns, specified by their names, as a
double value. <a class="reference internal" href="#pyspark.sql.DataFrame.cov" title="pyspark.sql.DataFrame.cov"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.cov()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.cov" title="pyspark.sql.DataFrameStatFunctions.cov"><code class="xref py py-func docutils literal"><span class="pre">DataFrameStatFunctions.cov()</span></code></a> are aliases.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col1</strong> – The name of the first column</li>
<li><strong>col2</strong> – The name of the second column</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameStatFunctions.crosstab">
<code class="descname">crosstab</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameStatFunctions.crosstab"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameStatFunctions.crosstab" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes a pair-wise frequency table of the given columns. Also known as a contingency
table. The number of distinct values for each column should be less than 1e4. At most 1e6
non-zero pair frequencies will be returned.
The first column of each row will be the distinct values of <cite>col1</cite> and the column names
will be the distinct values of <cite>col2</cite>. The name of the first column will be <cite>$col1_$col2</cite>.
Pairs that have no occurrences will have zero as their counts.
<a class="reference internal" href="#pyspark.sql.DataFrame.crosstab" title="pyspark.sql.DataFrame.crosstab"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.crosstab()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.crosstab" title="pyspark.sql.DataFrameStatFunctions.crosstab"><code class="xref py py-func docutils literal"><span class="pre">DataFrameStatFunctions.crosstab()</span></code></a> are aliases.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col1</strong> – The name of the first column. Distinct items will make the first item of
each row.</li>
<li><strong>col2</strong> – The name of the second column. Distinct items will make the column names
of the DataFrame.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameStatFunctions.freqItems">
<code class="descname">freqItems</code><span class="sig-paren">(</span><em>cols</em>, <em>support=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameStatFunctions.freqItems"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameStatFunctions.freqItems" title="Permalink to this definition">¶</a></dt>
<dd><p>Finding frequent items for columns, possibly with false positives. Using the
frequent element count algorithm described in
“<a class="reference external" href="http://dx.doi.org/10.1145/762471.762473">http://dx.doi.org/10.1145/762471.762473</a>, proposed by Karp, Schenker, and Papadimitriou”.
<a class="reference internal" href="#pyspark.sql.DataFrame.freqItems" title="pyspark.sql.DataFrame.freqItems"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.freqItems()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.freqItems" title="pyspark.sql.DataFrameStatFunctions.freqItems"><code class="xref py py-func docutils literal"><span class="pre">DataFrameStatFunctions.freqItems()</span></code></a> are aliases.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>cols</strong> – Names of the columns to calculate frequent items for as a list or tuple of
strings.</li>
<li><strong>support</strong> – The frequency with which to consider an item ‘frequent’. Default is 1%.
The support must be greater than 1e-4.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameStatFunctions.sampleBy">
<code class="descname">sampleBy</code><span class="sig-paren">(</span><em>col</em>, <em>fractions</em>, <em>seed=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/dataframe.html#DataFrameStatFunctions.sampleBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameStatFunctions.sampleBy" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a stratified sample without replacement based on the
fraction given on each stratum.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>col</strong> – column that defines strata</li>
<li><strong>fractions</strong> – sampling fraction for each stratum. If a stratum is not
specified, we treat its fraction as zero.</li>
<li><strong>seed</strong> – random seed</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">a new DataFrame that represents the stratified sample</p>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="kn">import</span> <span class="n">col</span>
<span class="gp">>>> </span><span class="n">dataset</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">((</span><span class="n">col</span><span class="p">(</span><span class="s">"id"</span><span class="p">)</span> <span class="o">%</span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"key"</span><span class="p">))</span>
<span class="gp">>>> </span><span class="n">sampled</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">sampleBy</span><span class="p">(</span><span class="s">"key"</span><span class="p">,</span> <span class="n">fractions</span><span class="o">=</span><span class="p">{</span><span class="mi">0</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mi">1</span><span class="p">:</span> <span class="mf">0.2</span><span class="p">},</span> <span class="n">seed</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">sampled</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s">"key"</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s">"key"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|key|count|</span>
<span class="go">+---+-----+</span>
<span class="go">| 0| 3|</span>
<span class="go">| 1| 8|</span>
<span class="go">+---+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.Window">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">Window</code><a class="reference internal" href="_modules/pyspark/sql/window.html#Window"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Window" title="Permalink to this definition">¶</a></dt>
<dd><p>Utility functions for defining window in DataFrames.</p>
<p>For example:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="c"># PARTITION BY country ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</span>
<span class="gp">>>> </span><span class="n">window</span> <span class="o">=</span> <span class="n">Window</span><span class="o">.</span><span class="n">partitionBy</span><span class="p">(</span><span class="s">"country"</span><span class="p">)</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s">"date"</span><span class="p">)</span><span class="o">.</span><span class="n">rowsBetween</span><span class="p">(</span><span class="o">-</span><span class="n">sys</span><span class="o">.</span><span class="n">maxsize</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="c"># PARTITION BY country ORDER BY date RANGE BETWEEN 3 PRECEDING AND 3 FOLLOWING</span>
<span class="gp">>>> </span><span class="n">window</span> <span class="o">=</span> <span class="n">Window</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s">"date"</span><span class="p">)</span><span class="o">.</span><span class="n">partitionBy</span><span class="p">(</span><span class="s">"country"</span><span class="p">)</span><span class="o">.</span><span class="n">rangeBetween</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
</pre></div>
</div>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Experimental</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
<dl class="staticmethod">
<dt id="pyspark.sql.Window.orderBy">
<em class="property">static </em><code class="descname">orderBy</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/window.html#Window.orderBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Window.orderBy" title="Permalink to this definition">¶</a></dt>
<dd><p>Creates a <a class="reference internal" href="#pyspark.sql.WindowSpec" title="pyspark.sql.WindowSpec"><code class="xref py py-class docutils literal"><span class="pre">WindowSpec</span></code></a> with the partitioning defined.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="staticmethod">
<dt id="pyspark.sql.Window.partitionBy">
<em class="property">static </em><code class="descname">partitionBy</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/window.html#Window.partitionBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.Window.partitionBy" title="Permalink to this definition">¶</a></dt>
<dd><p>Creates a <a class="reference internal" href="#pyspark.sql.WindowSpec" title="pyspark.sql.WindowSpec"><code class="xref py py-class docutils literal"><span class="pre">WindowSpec</span></code></a> with the partitioning defined.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.WindowSpec">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">WindowSpec</code><span class="sig-paren">(</span><em>jspec</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/window.html#WindowSpec"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.WindowSpec" title="Permalink to this definition">¶</a></dt>
<dd><p>A window specification that defines the partitioning, ordering,
and frame boundaries.</p>
<p>Use the static methods in <a class="reference internal" href="#pyspark.sql.Window" title="pyspark.sql.Window"><code class="xref py py-class docutils literal"><span class="pre">Window</span></code></a> to create a <a class="reference internal" href="#pyspark.sql.WindowSpec" title="pyspark.sql.WindowSpec"><code class="xref py py-class docutils literal"><span class="pre">WindowSpec</span></code></a>.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Experimental</p>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.WindowSpec.orderBy">
<code class="descname">orderBy</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/window.html#WindowSpec.orderBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.WindowSpec.orderBy" title="Permalink to this definition">¶</a></dt>
<dd><p>Defines the ordering columns in a <a class="reference internal" href="#pyspark.sql.WindowSpec" title="pyspark.sql.WindowSpec"><code class="xref py py-class docutils literal"><span class="pre">WindowSpec</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – names of columns or expressions</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.WindowSpec.partitionBy">
<code class="descname">partitionBy</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/window.html#WindowSpec.partitionBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.WindowSpec.partitionBy" title="Permalink to this definition">¶</a></dt>
<dd><p>Defines the partitioning columns in a <a class="reference internal" href="#pyspark.sql.WindowSpec" title="pyspark.sql.WindowSpec"><code class="xref py py-class docutils literal"><span class="pre">WindowSpec</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – names of columns or expressions</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.WindowSpec.rangeBetween">
<code class="descname">rangeBetween</code><span class="sig-paren">(</span><em>start</em>, <em>end</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/window.html#WindowSpec.rangeBetween"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.WindowSpec.rangeBetween" title="Permalink to this definition">¶</a></dt>
<dd><p>Defines the frame boundaries, from <cite>start</cite> (inclusive) to <cite>end</cite> (inclusive).</p>
<p>Both <cite>start</cite> and <cite>end</cite> are relative from the current row. For example,
“0” means “current row”, while “-1” means one off before the current row,
and “5” means the five off after the current row.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>start</strong> – boundary start, inclusive.
The frame is unbounded if this is <code class="docutils literal"><span class="pre">-sys.maxsize</span></code> (or lower).</li>
<li><strong>end</strong> – boundary end, inclusive.
The frame is unbounded if this is <code class="docutils literal"><span class="pre">sys.maxsize</span></code> (or higher).</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.WindowSpec.rowsBetween">
<code class="descname">rowsBetween</code><span class="sig-paren">(</span><em>start</em>, <em>end</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/window.html#WindowSpec.rowsBetween"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.WindowSpec.rowsBetween" title="Permalink to this definition">¶</a></dt>
<dd><p>Defines the frame boundaries, from <cite>start</cite> (inclusive) to <cite>end</cite> (inclusive).</p>
<p>Both <cite>start</cite> and <cite>end</cite> are relative positions from the current row.
For example, “0” means “current row”, while “-1” means the row before
the current row, and “5” means the fifth row after the current row.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>start</strong> – boundary start, inclusive.
The frame is unbounded if this is <code class="docutils literal"><span class="pre">-sys.maxsize</span></code> (or lower).</li>
<li><strong>end</strong> – boundary end, inclusive.
The frame is unbounded if this is <code class="docutils literal"><span class="pre">sys.maxsize</span></code> (or higher).</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.DataFrameReader">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">DataFrameReader</code><span class="sig-paren">(</span><em>sqlContext</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader" title="Permalink to this definition">¶</a></dt>
<dd><p>Interface used to load a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> from external storage systems
(e.g. file systems, key-value stores, etc). Use <a class="reference internal" href="#pyspark.sql.SQLContext.read" title="pyspark.sql.SQLContext.read"><code class="xref py py-func docutils literal"><span class="pre">SQLContext.read()</span></code></a>
to access this.</p>
<p>::Note: Experimental</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.format">
<code class="descname">format</code><span class="sig-paren">(</span><em>source</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.format"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.format" title="Permalink to this definition">¶</a></dt>
<dd><p>Specifies the input data source format.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>source</strong> – string, name of the data source, e.g. ‘json’, ‘parquet’.</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s">'json'</span><span class="p">)</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">'python/test_support/sql/people.json'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[('age', 'bigint'), ('name', 'string')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.jdbc">
<code class="descname">jdbc</code><span class="sig-paren">(</span><em>url</em>, <em>table</em>, <em>column=None</em>, <em>lowerBound=None</em>, <em>upperBound=None</em>, <em>numPartitions=None</em>, <em>predicates=None</em>, <em>properties=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.jdbc"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.jdbc" title="Permalink to this definition">¶</a></dt>
<dd><p>Construct a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> representing the database table accessible
via JDBC URL <cite>url</cite> named <cite>table</cite> and connection <cite>properties</cite>.</p>
<p>The <cite>column</cite> parameter could be used to partition the table, then it will
be retrieved in parallel based on the parameters passed to this function.</p>
<p>The <cite>predicates</cite> parameter gives a list expressions suitable for inclusion
in WHERE clauses; each one defines one partition of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<p>::Note: Don’t create too many partitions in parallel on a large cluster;
otherwise Spark might crash your external database systems.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>url</strong> – a JDBC URL</li>
<li><strong>table</strong> – name of table</li>
<li><strong>column</strong> – the column used to partition</li>
<li><strong>lowerBound</strong> – the lower bound of partition column</li>
<li><strong>upperBound</strong> – the upper bound of the partition column</li>
<li><strong>numPartitions</strong> – the number of partitions</li>
<li><strong>predicates</strong> – a list of expressions</li>
<li><strong>properties</strong> – JDBC database connection arguments, a list of arbitrary string
tag/value. Normally at least a “user” and “password” property
should be included.</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">a DataFrame</p>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.json">
<code class="descname">json</code><span class="sig-paren">(</span><em>path</em>, <em>schema=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.json"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.json" title="Permalink to this definition">¶</a></dt>
<dd><p>Loads a JSON file (one object per line) and returns the result as
a :class`DataFrame`.</p>
<p>If the <code class="docutils literal"><span class="pre">schema</span></code> parameter is not specified, this function goes
through the input once to determine the input schema.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – string, path to the JSON dataset.</li>
<li><strong>schema</strong> – an optional <code class="xref py py-class docutils literal"><span class="pre">StructType</span></code> for the input schema.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">json</span><span class="p">(</span><span class="s">'python/test_support/sql/people.json'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[('age', 'bigint'), ('name', 'string')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.load">
<code class="descname">load</code><span class="sig-paren">(</span><em>path=None</em>, <em>format=None</em>, <em>schema=None</em>, <em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.load"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.load" title="Permalink to this definition">¶</a></dt>
<dd><p>Loads data from a data source and returns it as a :class`DataFrame`.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – optional string for file-system backed data sources.</li>
<li><strong>format</strong> – optional string for format of the data source. Default to ‘parquet’.</li>
<li><strong>schema</strong> – optional <code class="xref py py-class docutils literal"><span class="pre">StructType</span></code> for the input schema.</li>
<li><strong>options</strong> – all other string options</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">'python/test_support/sql/parquet_partitioned'</span><span class="p">,</span> <span class="n">opt1</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="gp">... </span> <span class="n">opt2</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">opt3</span><span class="o">=</span><span class="s">'str'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.option">
<code class="descname">option</code><span class="sig-paren">(</span><em>key</em>, <em>value</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.option"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.option" title="Permalink to this definition">¶</a></dt>
<dd><p>Adds an input option for the underlying data source.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.options">
<code class="descname">options</code><span class="sig-paren">(</span><em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.options"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.options" title="Permalink to this definition">¶</a></dt>
<dd><p>Adds input options for the underlying data source.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.orc">
<code class="descname">orc</code><span class="sig-paren">(</span><em>path</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.orc"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.orc" title="Permalink to this definition">¶</a></dt>
<dd><p>Loads an ORC file, returning the result as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<p>::Note: Currently ORC support is only available together with
<a class="reference internal" href="#pyspark.sql.HiveContext" title="pyspark.sql.HiveContext"><code class="xref py py-class docutils literal"><span class="pre">HiveContext</span></code></a>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">hiveContext</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">orc</span><span class="p">(</span><span class="s">'python/test_support/sql/orc_partitioned'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[('a', 'bigint'), ('b', 'int'), ('c', 'int')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.parquet">
<code class="descname">parquet</code><span class="sig-paren">(</span><em>*paths</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.parquet"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.parquet" title="Permalink to this definition">¶</a></dt>
<dd><p>Loads a Parquet file, returning the result as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="s">'python/test_support/sql/parquet_partitioned'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.schema">
<code class="descname">schema</code><span class="sig-paren">(</span><em>schema</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.schema"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.schema" title="Permalink to this definition">¶</a></dt>
<dd><p>Specifies the input schema.</p>
<p>Some data sources (e.g. JSON) can infer the input schema automatically from data.
By specifying the schema here, the underlying data source can skip the schema
inference step, and thus speed up data loading.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>schema</strong> – a StructType object</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameReader.table">
<code class="descname">table</code><span class="sig-paren">(</span><em>tableName</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameReader.table"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameReader.table" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the specified table as a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>tableName</strong> – string, name of the table.</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="s">'python/test_support/sql/parquet_partitioned'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">registerTempTable</span><span class="p">(</span><span class="s">'tmpTable'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">table</span><span class="p">(</span><span class="s">'tmpTable'</span><span class="p">)</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.DataFrameWriter">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">DataFrameWriter</code><span class="sig-paren">(</span><em>df</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter" title="Permalink to this definition">¶</a></dt>
<dd><p>Interface used to write a [[DataFrame]] to external storage systems
(e.g. file systems, key-value stores, etc). Use <a class="reference internal" href="#pyspark.sql.DataFrame.write" title="pyspark.sql.DataFrame.write"><code class="xref py py-func docutils literal"><span class="pre">DataFrame.write()</span></code></a>
to access this.</p>
<p>::Note: Experimental</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.format">
<code class="descname">format</code><span class="sig-paren">(</span><em>source</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.format"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.format" title="Permalink to this definition">¶</a></dt>
<dd><p>Specifies the underlying output data source.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>source</strong> – string, name of the data source, e.g. ‘json’, ‘parquet’.</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s">'json'</span><span class="p">)</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="s">'data'</span><span class="p">))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.insertInto">
<code class="descname">insertInto</code><span class="sig-paren">(</span><em>tableName</em>, <em>overwrite=False</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.insertInto"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.insertInto" title="Permalink to this definition">¶</a></dt>
<dd><p>Inserts the content of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to the specified table.</p>
<p>It requires that the schema of the class:<cite>DataFrame</cite> is the same as the
schema of the table.</p>
<p>Optionally overwriting any existing data.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.jdbc">
<code class="descname">jdbc</code><span class="sig-paren">(</span><em>url</em>, <em>table</em>, <em>mode=None</em>, <em>properties=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.jdbc"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.jdbc" title="Permalink to this definition">¶</a></dt>
<dd><p>Saves the content of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to a external database table via JDBC.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Don’t create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.</p>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>url</strong> – a JDBC URL of the form <code class="docutils literal"><span class="pre">jdbc:subprotocol:subname</span></code></li>
<li><strong>table</strong> – Name of the table in the external database.</li>
<li><strong>mode</strong> – <p>specifies the behavior of the save operation when data already exists.</p>
<ul>
<li><code class="docutils literal"><span class="pre">append</span></code>: Append contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to existing data.</li>
<li><code class="docutils literal"><span class="pre">overwrite</span></code>: Overwrite existing data.</li>
<li><code class="docutils literal"><span class="pre">ignore</span></code>: Silently ignore this operation if data already exists.</li>
<li><code class="docutils literal"><span class="pre">error</span></code> (default case): Throw an exception if data already exists.</li>
</ul>
</li>
<li><strong>properties</strong> – JDBC database connection arguments, a list of
arbitrary string tag/value. Normally at least a
“user” and “password” property should be included.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.json">
<code class="descname">json</code><span class="sig-paren">(</span><em>path</em>, <em>mode=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.json"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.json" title="Permalink to this definition">¶</a></dt>
<dd><p>Saves the content of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> in JSON format at the specified path.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – the path in any Hadoop supported file system</li>
<li><strong>mode</strong> – <p>specifies the behavior of the save operation when data already exists.</p>
<ul>
<li><code class="docutils literal"><span class="pre">append</span></code>: Append contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to existing data.</li>
<li><code class="docutils literal"><span class="pre">overwrite</span></code>: Overwrite existing data.</li>
<li><code class="docutils literal"><span class="pre">ignore</span></code>: Silently ignore this operation if data already exists.</li>
<li><code class="docutils literal"><span class="pre">error</span></code> (default case): Throw an exception if data already exists.</li>
</ul>
</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">json</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="s">'data'</span><span class="p">))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.mode">
<code class="descname">mode</code><span class="sig-paren">(</span><em>saveMode</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.mode"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.mode" title="Permalink to this definition">¶</a></dt>
<dd><p>Specifies the behavior when data or table already exists.</p>
<p>Options include:</p>
<ul class="simple">
<li><cite>append</cite>: Append contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to existing data.</li>
<li><cite>overwrite</cite>: Overwrite existing data.</li>
<li><cite>error</cite>: Throw an exception if data already exists.</li>
<li><cite>ignore</cite>: Silently ignore this operation if data already exists.</li>
</ul>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">mode</span><span class="p">(</span><span class="s">'append'</span><span class="p">)</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="s">'data'</span><span class="p">))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.option">
<code class="descname">option</code><span class="sig-paren">(</span><em>key</em>, <em>value</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.option"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.option" title="Permalink to this definition">¶</a></dt>
<dd><p>Adds an output option for the underlying data source.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.options">
<code class="descname">options</code><span class="sig-paren">(</span><em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.options"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.options" title="Permalink to this definition">¶</a></dt>
<dd><p>Adds output options for the underlying data source.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.orc">
<code class="descname">orc</code><span class="sig-paren">(</span><em>path</em>, <em>mode=None</em>, <em>partitionBy=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.orc"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.orc" title="Permalink to this definition">¶</a></dt>
<dd><p>Saves the content of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> in ORC format at the specified path.</p>
<p>::Note: Currently ORC support is only available together with
<a class="reference internal" href="#pyspark.sql.HiveContext" title="pyspark.sql.HiveContext"><code class="xref py py-class docutils literal"><span class="pre">HiveContext</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – the path in any Hadoop supported file system</li>
<li><strong>mode</strong> – <p>specifies the behavior of the save operation when data already exists.</p>
<ul>
<li><code class="docutils literal"><span class="pre">append</span></code>: Append contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to existing data.</li>
<li><code class="docutils literal"><span class="pre">overwrite</span></code>: Overwrite existing data.</li>
<li><code class="docutils literal"><span class="pre">ignore</span></code>: Silently ignore this operation if data already exists.</li>
<li><code class="docutils literal"><span class="pre">error</span></code> (default case): Throw an exception if data already exists.</li>
</ul>
</li>
<li><strong>partitionBy</strong> – names of partitioning columns</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">orc_df</span> <span class="o">=</span> <span class="n">hiveContext</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">orc</span><span class="p">(</span><span class="s">'python/test_support/sql/orc_partitioned'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">orc_df</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">orc</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="s">'data'</span><span class="p">))</span>
</pre></div>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.parquet">
<code class="descname">parquet</code><span class="sig-paren">(</span><em>path</em>, <em>mode=None</em>, <em>partitionBy=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.parquet"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.parquet" title="Permalink to this definition">¶</a></dt>
<dd><p>Saves the content of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> in Parquet format at the specified path.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – the path in any Hadoop supported file system</li>
<li><strong>mode</strong> – <p>specifies the behavior of the save operation when data already exists.</p>
<ul>
<li><code class="docutils literal"><span class="pre">append</span></code>: Append contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to existing data.</li>
<li><code class="docutils literal"><span class="pre">overwrite</span></code>: Overwrite existing data.</li>
<li><code class="docutils literal"><span class="pre">ignore</span></code>: Silently ignore this operation if data already exists.</li>
<li><code class="docutils literal"><span class="pre">error</span></code> (default case): Throw an exception if data already exists.</li>
</ul>
</li>
<li><strong>partitionBy</strong> – names of partitioning columns</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="s">'data'</span><span class="p">))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.partitionBy">
<code class="descname">partitionBy</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.partitionBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.partitionBy" title="Permalink to this definition">¶</a></dt>
<dd><p>Partitions the output by the given columns on the file system.</p>
<p>If specified, the output is laid out on the file system similar
to Hive’s partitioning scheme.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – name of columns</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">partitionBy</span><span class="p">(</span><span class="s">'year'</span><span class="p">,</span> <span class="s">'month'</span><span class="p">)</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="s">'data'</span><span class="p">))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.save">
<code class="descname">save</code><span class="sig-paren">(</span><em>path=None</em>, <em>format=None</em>, <em>mode=None</em>, <em>partitionBy=None</em>, <em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.save"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.save" title="Permalink to this definition">¶</a></dt>
<dd><p>Saves the contents of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to a data source.</p>
<p>The data source is specified by the <code class="docutils literal"><span class="pre">format</span></code> and a set of <code class="docutils literal"><span class="pre">options</span></code>.
If <code class="docutils literal"><span class="pre">format</span></code> is not specified, the default data source configured by
<code class="docutils literal"><span class="pre">spark.sql.sources.default</span></code> will be used.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>path</strong> – the path in a Hadoop supported file system</li>
<li><strong>format</strong> – the format used to save</li>
<li><strong>mode</strong> – <p>specifies the behavior of the save operation when data already exists.</p>
<ul>
<li><code class="docutils literal"><span class="pre">append</span></code>: Append contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to existing data.</li>
<li><code class="docutils literal"><span class="pre">overwrite</span></code>: Overwrite existing data.</li>
<li><code class="docutils literal"><span class="pre">ignore</span></code>: Silently ignore this operation if data already exists.</li>
<li><code class="docutils literal"><span class="pre">error</span></code> (default case): Throw an exception if data already exists.</li>
</ul>
</li>
<li><strong>partitionBy</strong> – names of partitioning columns</li>
<li><strong>options</strong> – all other string options</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">write</span><span class="o">.</span><span class="n">mode</span><span class="p">(</span><span class="s">'append'</span><span class="p">)</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">(),</span> <span class="s">'data'</span><span class="p">))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.DataFrameWriter.saveAsTable">
<code class="descname">saveAsTable</code><span class="sig-paren">(</span><em>name</em>, <em>format=None</em>, <em>mode=None</em>, <em>partitionBy=None</em>, <em>**options</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/readwriter.html#DataFrameWriter.saveAsTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrameWriter.saveAsTable" title="Permalink to this definition">¶</a></dt>
<dd><p>Saves the content of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> as the specified table.</p>
<p>In the case the table already exists, behavior of this function depends on the
save mode, specified by the <cite>mode</cite> function (default to throwing an exception).
When <cite>mode</cite> is <cite>Overwrite</cite>, the schema of the [[DataFrame]] does not need to be
the same as that of the existing table.</p>
<ul class="simple">
<li><cite>append</cite>: Append contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code></a> to existing data.</li>
<li><cite>overwrite</cite>: Overwrite existing data.</li>
<li><cite>error</cite>: Throw an exception if data already exists.</li>
<li><cite>ignore</cite>: Silently ignore this operation if data already exists.</li>
</ul>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>name</strong> – the table name</li>
<li><strong>format</strong> – the format used to save</li>
<li><strong>mode</strong> – one of <cite>append</cite>, <cite>overwrite</cite>, <cite>error</cite>, <cite>ignore</cite> (default: error)</li>
<li><strong>partitionBy</strong> – names of partitioning columns</li>
<li><strong>options</strong> – all other string options</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
</dd></dl>
</div>
<div class="section" id="module-pyspark.sql.types">
<span id="pyspark-sql-types-module"></span><h2>pyspark.sql.types module<a class="headerlink" href="#module-pyspark.sql.types" title="Permalink to this headline">¶</a></h2>
<dl class="class">
<dt id="pyspark.sql.types.DataType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">DataType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#DataType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DataType" title="Permalink to this definition">¶</a></dt>
<dd><p>Base class for data types.</p>
<dl class="method">
<dt id="pyspark.sql.types.DataType.fromInternal">
<code class="descname">fromInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DataType.fromInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DataType.fromInternal" title="Permalink to this definition">¶</a></dt>
<dd><p>Converts an internal SQL object into a native Python object.</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DataType.json">
<code class="descname">json</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DataType.json"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DataType.json" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DataType.jsonValue">
<code class="descname">jsonValue</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DataType.jsonValue"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DataType.jsonValue" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DataType.needConversion">
<code class="descname">needConversion</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DataType.needConversion"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DataType.needConversion" title="Permalink to this definition">¶</a></dt>
<dd><p>Does this type need to conversion between Python object and internal SQL object.</p>
<p>This is used to avoid the unnecessary conversion for ArrayType/MapType/StructType.</p>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DataType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DataType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DataType.simpleString" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DataType.toInternal">
<code class="descname">toInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DataType.toInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DataType.toInternal" title="Permalink to this definition">¶</a></dt>
<dd><p>Converts a Python object into an internal SQL object.</p>
</dd></dl>
<dl class="classmethod">
<dt id="pyspark.sql.types.DataType.typeName">
<em class="property">classmethod </em><code class="descname">typeName</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DataType.typeName"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DataType.typeName" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.NullType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">NullType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#NullType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.NullType" title="Permalink to this definition">¶</a></dt>
<dd><p>Null type.</p>
<p>The data type representing None, used for the types that cannot be inferred.</p>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.StringType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">StringType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#StringType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StringType" title="Permalink to this definition">¶</a></dt>
<dd><p>String data type.</p>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.BinaryType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">BinaryType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#BinaryType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.BinaryType" title="Permalink to this definition">¶</a></dt>
<dd><p>Binary (byte array) data type.</p>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.BooleanType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">BooleanType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#BooleanType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.BooleanType" title="Permalink to this definition">¶</a></dt>
<dd><p>Boolean data type.</p>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.DateType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">DateType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#DateType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DateType" title="Permalink to this definition">¶</a></dt>
<dd><p>Date (datetime.date) data type.</p>
<dl class="attribute">
<dt id="pyspark.sql.types.DateType.EPOCH_ORDINAL">
<code class="descname">EPOCH_ORDINAL</code><em class="property"> = 719163</em><a class="headerlink" href="#pyspark.sql.types.DateType.EPOCH_ORDINAL" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DateType.fromInternal">
<code class="descname">fromInternal</code><span class="sig-paren">(</span><em>v</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DateType.fromInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DateType.fromInternal" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DateType.needConversion">
<code class="descname">needConversion</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DateType.needConversion"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DateType.needConversion" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DateType.toInternal">
<code class="descname">toInternal</code><span class="sig-paren">(</span><em>d</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DateType.toInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DateType.toInternal" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.TimestampType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">TimestampType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#TimestampType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.TimestampType" title="Permalink to this definition">¶</a></dt>
<dd><p>Timestamp (datetime.datetime) data type.</p>
<dl class="method">
<dt id="pyspark.sql.types.TimestampType.fromInternal">
<code class="descname">fromInternal</code><span class="sig-paren">(</span><em>ts</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#TimestampType.fromInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.TimestampType.fromInternal" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.TimestampType.needConversion">
<code class="descname">needConversion</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#TimestampType.needConversion"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.TimestampType.needConversion" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.TimestampType.toInternal">
<code class="descname">toInternal</code><span class="sig-paren">(</span><em>dt</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#TimestampType.toInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.TimestampType.toInternal" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.DecimalType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">DecimalType</code><span class="sig-paren">(</span><em>precision=10</em>, <em>scale=0</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DecimalType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DecimalType" title="Permalink to this definition">¶</a></dt>
<dd><p>Decimal (decimal.Decimal) data type.</p>
<p>The DecimalType must have fixed precision (the maximum total number of digits)
and scale (the number of digits on the right of dot). For example, (5, 2) can
support the value from [-999.99 to 999.99].</p>
<p>The precision can be up to 38, the scale must less or equal to precision.</p>
<p>When create a DecimalType, the default precision and scale is (10, 0). When infer
schema from decimal.Decimal objects, it will be DecimalType(38, 18).</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>precision</strong> – the maximum total number of digits (default: 10)</li>
<li><strong>scale</strong> – the number of digits on right side of dot. (default: 0)</li>
</ul>
</td>
</tr>
</tbody>
</table>
<dl class="method">
<dt id="pyspark.sql.types.DecimalType.jsonValue">
<code class="descname">jsonValue</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DecimalType.jsonValue"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DecimalType.jsonValue" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.DecimalType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#DecimalType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DecimalType.simpleString" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.DoubleType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">DoubleType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#DoubleType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.DoubleType" title="Permalink to this definition">¶</a></dt>
<dd><p>Double data type, representing double precision floats.</p>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.FloatType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">FloatType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#FloatType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.FloatType" title="Permalink to this definition">¶</a></dt>
<dd><p>Float data type, representing single precision floats.</p>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.ByteType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">ByteType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#ByteType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ByteType" title="Permalink to this definition">¶</a></dt>
<dd><p>Byte data type, i.e. a signed integer in a single byte.</p>
<dl class="method">
<dt id="pyspark.sql.types.ByteType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ByteType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ByteType.simpleString" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.IntegerType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">IntegerType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#IntegerType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.IntegerType" title="Permalink to this definition">¶</a></dt>
<dd><p>Int data type, i.e. a signed 32-bit integer.</p>
<dl class="method">
<dt id="pyspark.sql.types.IntegerType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#IntegerType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.IntegerType.simpleString" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.LongType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">LongType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#LongType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.LongType" title="Permalink to this definition">¶</a></dt>
<dd><p>Long data type, i.e. a signed 64-bit integer.</p>
<p>If the values are beyond the range of [-9223372036854775808, 9223372036854775807],
please use <a class="reference internal" href="#pyspark.sql.types.DecimalType" title="pyspark.sql.types.DecimalType"><code class="xref py py-class docutils literal"><span class="pre">DecimalType</span></code></a>.</p>
<dl class="method">
<dt id="pyspark.sql.types.LongType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#LongType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.LongType.simpleString" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.ShortType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">ShortType</code><a class="reference internal" href="_modules/pyspark/sql/types.html#ShortType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ShortType" title="Permalink to this definition">¶</a></dt>
<dd><p>Short data type, i.e. a signed 16-bit integer.</p>
<dl class="method">
<dt id="pyspark.sql.types.ShortType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ShortType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ShortType.simpleString" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.ArrayType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">ArrayType</code><span class="sig-paren">(</span><em>elementType</em>, <em>containsNull=True</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ArrayType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ArrayType" title="Permalink to this definition">¶</a></dt>
<dd><p>Array data type.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>elementType</strong> – <a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">DataType</span></code></a> of each element in the array.</li>
<li><strong>containsNull</strong> – boolean, whether the array can contain null (None) values.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<dl class="method">
<dt id="pyspark.sql.types.ArrayType.fromInternal">
<code class="descname">fromInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ArrayType.fromInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ArrayType.fromInternal" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="classmethod">
<dt id="pyspark.sql.types.ArrayType.fromJson">
<em class="property">classmethod </em><code class="descname">fromJson</code><span class="sig-paren">(</span><em>json</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ArrayType.fromJson"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ArrayType.fromJson" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.ArrayType.jsonValue">
<code class="descname">jsonValue</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ArrayType.jsonValue"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ArrayType.jsonValue" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.ArrayType.needConversion">
<code class="descname">needConversion</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ArrayType.needConversion"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ArrayType.needConversion" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.ArrayType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ArrayType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ArrayType.simpleString" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.ArrayType.toInternal">
<code class="descname">toInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#ArrayType.toInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.ArrayType.toInternal" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.MapType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">MapType</code><span class="sig-paren">(</span><em>keyType</em>, <em>valueType</em>, <em>valueContainsNull=True</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#MapType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.MapType" title="Permalink to this definition">¶</a></dt>
<dd><p>Map data type.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>keyType</strong> – <a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">DataType</span></code></a> of the keys in the map.</li>
<li><strong>valueType</strong> – <a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">DataType</span></code></a> of the values in the map.</li>
<li><strong>valueContainsNull</strong> – indicates whether values can contain null (None) values.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<p>Keys in a map data type are not allowed to be null (None).</p>
<dl class="method">
<dt id="pyspark.sql.types.MapType.fromInternal">
<code class="descname">fromInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#MapType.fromInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.MapType.fromInternal" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="classmethod">
<dt id="pyspark.sql.types.MapType.fromJson">
<em class="property">classmethod </em><code class="descname">fromJson</code><span class="sig-paren">(</span><em>json</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#MapType.fromJson"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.MapType.fromJson" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.MapType.jsonValue">
<code class="descname">jsonValue</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#MapType.jsonValue"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.MapType.jsonValue" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.MapType.needConversion">
<code class="descname">needConversion</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#MapType.needConversion"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.MapType.needConversion" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.MapType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#MapType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.MapType.simpleString" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.MapType.toInternal">
<code class="descname">toInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#MapType.toInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.MapType.toInternal" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.StructField">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">StructField</code><span class="sig-paren">(</span><em>name</em>, <em>dataType</em>, <em>nullable=True</em>, <em>metadata=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructField"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructField" title="Permalink to this definition">¶</a></dt>
<dd><p>A field in <a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal"><span class="pre">StructType</span></code></a>.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>name</strong> – string, name of the field.</li>
<li><strong>dataType</strong> – <a class="reference internal" href="#pyspark.sql.types.DataType" title="pyspark.sql.types.DataType"><code class="xref py py-class docutils literal"><span class="pre">DataType</span></code></a> of the field.</li>
<li><strong>nullable</strong> – boolean, whether the field can be null (None) or not.</li>
<li><strong>metadata</strong> – a dict from string to simple type that can be toInternald to JSON automatically</li>
</ul>
</td>
</tr>
</tbody>
</table>
<dl class="method">
<dt id="pyspark.sql.types.StructField.fromInternal">
<code class="descname">fromInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructField.fromInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructField.fromInternal" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="classmethod">
<dt id="pyspark.sql.types.StructField.fromJson">
<em class="property">classmethod </em><code class="descname">fromJson</code><span class="sig-paren">(</span><em>json</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructField.fromJson"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructField.fromJson" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructField.jsonValue">
<code class="descname">jsonValue</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructField.jsonValue"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructField.jsonValue" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructField.needConversion">
<code class="descname">needConversion</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructField.needConversion"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructField.needConversion" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructField.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructField.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructField.simpleString" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructField.toInternal">
<code class="descname">toInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructField.toInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructField.toInternal" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="class">
<dt id="pyspark.sql.types.StructType">
<em class="property">class </em><code class="descclassname">pyspark.sql.types.</code><code class="descname">StructType</code><span class="sig-paren">(</span><em>fields=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructType"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructType" title="Permalink to this definition">¶</a></dt>
<dd><p>Struct type, consisting of a list of <a class="reference internal" href="#pyspark.sql.types.StructField" title="pyspark.sql.types.StructField"><code class="xref py py-class docutils literal"><span class="pre">StructField</span></code></a>.</p>
<p>This is the data type representing a <code class="xref py py-class docutils literal"><span class="pre">Row</span></code>.</p>
<dl class="method">
<dt id="pyspark.sql.types.StructType.add">
<code class="descname">add</code><span class="sig-paren">(</span><em>field</em>, <em>data_type=None</em>, <em>nullable=True</em>, <em>metadata=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructType.add"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructType.add" title="Permalink to this definition">¶</a></dt>
<dd><p>Construct a StructType by adding new elements to it to define the schema. The method accepts
either:</p>
<blockquote>
<div><ol class="loweralpha simple">
<li>A single parameter which is a StructField object.</li>
<li>Between 2 and 4 parameters as (name, data_type, nullable (optional),
metadata(optional). The data_type parameter may be either a String or a
DataType object.</li>
</ol>
</div></blockquote>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">struct1</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">()</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="s">"f1"</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">)</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="s">"f2"</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">struct2</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">([</span><span class="n">StructField</span><span class="p">(</span><span class="s">"f1"</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">),</span> <span class="n">StructField</span><span class="p">(</span><span class="s">"f2"</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">,</span> <span class="bp">None</span><span class="p">)])</span>
<span class="gp">>>> </span><span class="n">struct1</span> <span class="o">==</span> <span class="n">struct2</span>
<span class="go">True</span>
<span class="gp">>>> </span><span class="n">struct1</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">()</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">StructField</span><span class="p">(</span><span class="s">"f1"</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">))</span>
<span class="gp">>>> </span><span class="n">struct2</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">([</span><span class="n">StructField</span><span class="p">(</span><span class="s">"f1"</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">)])</span>
<span class="gp">>>> </span><span class="n">struct1</span> <span class="o">==</span> <span class="n">struct2</span>
<span class="go">True</span>
<span class="gp">>>> </span><span class="n">struct1</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">()</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="s">"f1"</span><span class="p">,</span> <span class="s">"string"</span><span class="p">,</span> <span class="bp">True</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">struct2</span> <span class="o">=</span> <span class="n">StructType</span><span class="p">([</span><span class="n">StructField</span><span class="p">(</span><span class="s">"f1"</span><span class="p">,</span> <span class="n">StringType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">)])</span>
<span class="gp">>>> </span><span class="n">struct1</span> <span class="o">==</span> <span class="n">struct2</span>
<span class="go">True</span>
</pre></div>
</div>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first simple">
<li><strong>field</strong> – Either the name of the field or a StructField object</li>
<li><strong>data_type</strong> – If present, the DataType of the StructField to create</li>
<li><strong>nullable</strong> – Whether the field to add should be nullable (default True)</li>
<li><strong>metadata</strong> – Any additional metadata (default None)</li>
</ul>
</td>
</tr>
<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body"><p class="first last">a new updated StructType</p>
</td>
</tr>
</tbody>
</table>
</dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructType.fromInternal">
<code class="descname">fromInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructType.fromInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructType.fromInternal" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="classmethod">
<dt id="pyspark.sql.types.StructType.fromJson">
<em class="property">classmethod </em><code class="descname">fromJson</code><span class="sig-paren">(</span><em>json</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructType.fromJson"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructType.fromJson" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructType.jsonValue">
<code class="descname">jsonValue</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructType.jsonValue"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructType.jsonValue" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructType.needConversion">
<code class="descname">needConversion</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructType.needConversion"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructType.needConversion" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructType.simpleString">
<code class="descname">simpleString</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructType.simpleString"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructType.simpleString" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
<dl class="method">
<dt id="pyspark.sql.types.StructType.toInternal">
<code class="descname">toInternal</code><span class="sig-paren">(</span><em>obj</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/types.html#StructType.toInternal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.types.StructType.toInternal" title="Permalink to this definition">¶</a></dt>
<dd></dd></dl>
</dd></dl>
</div>
<div class="section" id="module-pyspark.sql.functions">
<span id="pyspark-sql-functions-module"></span><h2>pyspark.sql.functions module<a class="headerlink" href="#module-pyspark.sql.functions" title="Permalink to this headline">¶</a></h2>
<p>A collections of builtin functions</p>
<dl class="function">
<dt id="pyspark.sql.functions.abs">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">abs</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.abs" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the absolute value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.acos">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">acos</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.acos" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the cosine inverse of the given value; the returned angle is in the range0.0 through pi.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.add_months">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">add_months</code><span class="sig-paren">(</span><em>start</em>, <em>months</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#add_months"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.add_months" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the date that is <cite>months</cite> months after <cite>start</cite></p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'2015-04-08'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'d'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">add_months</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'d'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(d=datetime.date(2015, 5, 8))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.approxCountDistinct">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">approxCountDistinct</code><span class="sig-paren">(</span><em>col</em>, <em>rsd=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#approxCountDistinct"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.approxCountDistinct" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> for approximate distinct count of <code class="docutils literal"><span class="pre">col</span></code>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">approxCountDistinct</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'c'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(c=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.array">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">array</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#array"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.array" title="Permalink to this definition">¶</a></dt>
<dd><p>Creates a new array column.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of column names (string) or list of <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> expressions that have
the same data type.</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">array</span><span class="p">(</span><span class="s">'age'</span><span class="p">,</span> <span class="s">'age'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"arr"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(arr=[2, 2]), Row(arr=[5, 5])]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">array</span><span class="p">([</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">])</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"arr"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(arr=[2, 2]), Row(arr=[5, 5])]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.array_contains">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">array_contains</code><span class="sig-paren">(</span><em>col</em>, <em>value</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#array_contains"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.array_contains" title="Permalink to this definition">¶</a></dt>
<dd><p>Collection function: returns True if the array contains the given value. The collection
elements and value must be of the same type.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col</strong> – name of column containing array</li>
<li><strong>value</strong> – value to check for in array</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([([</span><span class="s">"a"</span><span class="p">,</span> <span class="s">"b"</span><span class="p">,</span> <span class="s">"c"</span><span class="p">],),</span> <span class="p">([],)],</span> <span class="p">[</span><span class="s">'data'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">array_contains</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">data</span><span class="p">,</span> <span class="s">"a"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(array_contains(data,a)=True), Row(array_contains(data,a)=False)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.asc">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">asc</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.asc" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a sort expression based on the ascending order of the given column name.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.ascii">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">ascii</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.ascii" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the numeric value of the first character of the string column.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.asin">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">asin</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.asin" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the sine inverse of the given value; the returned angle is in the range-pi/2 through pi/2.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.atan">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">atan</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.atan" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the tangent inverse of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.atan2">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">atan2</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.atan2" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the angle theta from the conversion of rectangular coordinates (x, y) topolar coordinates (r, theta).</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.avg">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">avg</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.avg" title="Permalink to this definition">¶</a></dt>
<dd><p>Aggregate function: returns the average of the values in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.base64">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">base64</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.base64" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the BASE64 encoding of a binary column and returns it as a string column.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.bin">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">bin</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#bin"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.bin" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the string representation of the binary value of the given column.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="nb">bin</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'c'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(c=u'10'), Row(c=u'101')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.bitwiseNOT">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">bitwiseNOT</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.bitwiseNOT" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes bitwise not.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.cbrt">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">cbrt</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.cbrt" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the cube-root of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.ceil">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">ceil</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.ceil" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the ceiling of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.coalesce">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">coalesce</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#coalesce"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.coalesce" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the first column that is not null.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">cDf</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="bp">None</span><span class="p">,</span> <span class="bp">None</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="bp">None</span><span class="p">),</span> <span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="mi">2</span><span class="p">)],</span> <span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="s">"b"</span><span class="p">))</span>
<span class="gp">>>> </span><span class="n">cDf</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+----+</span>
<span class="go">| a| b|</span>
<span class="go">+----+----+</span>
<span class="go">|null|null|</span>
<span class="go">| 1|null|</span>
<span class="go">|null| 2|</span>
<span class="go">+----+----+</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">cDf</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">coalesce</span><span class="p">(</span><span class="n">cDf</span><span class="p">[</span><span class="s">"a"</span><span class="p">],</span> <span class="n">cDf</span><span class="p">[</span><span class="s">"b"</span><span class="p">]))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-------------+</span>
<span class="go">|coalesce(a,b)|</span>
<span class="go">+-------------+</span>
<span class="go">| null|</span>
<span class="go">| 1|</span>
<span class="go">| 2|</span>
<span class="go">+-------------+</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">cDf</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">'*'</span><span class="p">,</span> <span class="n">coalesce</span><span class="p">(</span><span class="n">cDf</span><span class="p">[</span><span class="s">"a"</span><span class="p">],</span> <span class="n">lit</span><span class="p">(</span><span class="mf">0.0</span><span class="p">)))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+----+---------------+</span>
<span class="go">| a| b|coalesce(a,0.0)|</span>
<span class="go">+----+----+---------------+</span>
<span class="go">|null|null| 0.0|</span>
<span class="go">| 1|null| 1.0|</span>
<span class="go">|null| 2| 0.0|</span>
<span class="go">+----+----+---------------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.col">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">col</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.col" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> based on the given column name.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.column">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">column</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.column" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> based on the given column name.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.concat">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">concat</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#concat"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.concat" title="Permalink to this definition">¶</a></dt>
<dd><p>Concatenates multiple input string columns together into a single string column.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'abcd'</span><span class="p">,</span><span class="s">'123'</span><span class="p">)],</span> <span class="p">[</span><span class="s">'s'</span><span class="p">,</span> <span class="s">'d'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">concat</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'s'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=u'abcd123')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.concat_ws">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">concat_ws</code><span class="sig-paren">(</span><em>sep</em>, <em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#concat_ws"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.concat_ws" title="Permalink to this definition">¶</a></dt>
<dd><p>Concatenates multiple input string columns together into a single string column,
using the given separator.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'abcd'</span><span class="p">,</span><span class="s">'123'</span><span class="p">)],</span> <span class="p">[</span><span class="s">'s'</span><span class="p">,</span> <span class="s">'d'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">concat_ws</span><span class="p">(</span><span class="s">'-'</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'s'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=u'abcd-123')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.conv">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">conv</code><span class="sig-paren">(</span><em>col</em>, <em>fromBase</em>, <em>toBase</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#conv"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.conv" title="Permalink to this definition">¶</a></dt>
<dd><p>Convert a number in a string column from one base to another.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">"010101"</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'n'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">conv</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">n</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">16</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'hex'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(hex=u'15')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.cos">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">cos</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.cos" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the cosine of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.cosh">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">cosh</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.cosh" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the hyperbolic cosine of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.count">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">count</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.count" title="Permalink to this definition">¶</a></dt>
<dd><p>Aggregate function: returns the number of items in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.countDistinct">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">countDistinct</code><span class="sig-paren">(</span><em>col</em>, <em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#countDistinct"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.countDistinct" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> for distinct count of <code class="docutils literal"><span class="pre">col</span></code> or <code class="docutils literal"><span class="pre">cols</span></code>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">countDistinct</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'c'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(c=2)]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">countDistinct</span><span class="p">(</span><span class="s">"age"</span><span class="p">,</span> <span class="s">"name"</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'c'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(c=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.crc32">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">crc32</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#crc32"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.crc32" title="Permalink to this definition">¶</a></dt>
<dd><p>Calculates the cyclic redundancy check value (CRC32) of a binary column and
returns the value as a bigint.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'ABC'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">crc32</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'crc32'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(crc32=2743272264)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.cumeDist">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">cumeDist</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.cumeDist" title="Permalink to this definition">¶</a></dt>
<dd><p>Window function: returns the cumulative distribution of values within a window partition,
i.e. the fraction of rows that are below the current row.</p>
<p>This is equivalent to the CUME_DIST function in SQL.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.current_date">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">current_date</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#current_date"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.current_date" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the current date as a date column.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.current_timestamp">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">current_timestamp</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#current_timestamp"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.current_timestamp" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the current timestamp as a timestamp column.</p>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.date_add">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">date_add</code><span class="sig-paren">(</span><em>start</em>, <em>days</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#date_add"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.date_add" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the date that is <cite>days</cite> days after <cite>start</cite></p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'2015-04-08'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'d'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">date_add</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'d'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(d=datetime.date(2015, 4, 9))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.date_format">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">date_format</code><span class="sig-paren">(</span><em>date</em>, <em>format</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#date_format"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.date_format" title="Permalink to this definition">¶</a></dt>
<dd><p>Converts a date/timestamp/string to a value of string in the format specified by the date
format given by the second argument.</p>
<p>A pattern could be for instance <cite>dd.MM.yyyy</cite> and could return a string like ‘18.03.1993’. All
pattern letters of the Java class <cite>java.text.SimpleDateFormat</cite> can be used.</p>
<p>NOTE: Use when ever possible specialized functions like <cite>year</cite>. These benefit from a
specialized implementation.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'2015-04-08'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">date_format</span><span class="p">(</span><span class="s">'a'</span><span class="p">,</span> <span class="s">'MM/dd/yyy'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'date'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(date=u'04/08/2015')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.date_sub">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">date_sub</code><span class="sig-paren">(</span><em>start</em>, <em>days</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#date_sub"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.date_sub" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the date that is <cite>days</cite> days before <cite>start</cite></p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'2015-04-08'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'d'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">date_sub</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'d'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(d=datetime.date(2015, 4, 7))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.datediff">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">datediff</code><span class="sig-paren">(</span><em>end</em>, <em>start</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#datediff"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.datediff" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the number of days from <cite>start</cite> to <cite>end</cite>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'2015-04-08'</span><span class="p">,</span><span class="s">'2015-05-10'</span><span class="p">)],</span> <span class="p">[</span><span class="s">'d1'</span><span class="p">,</span> <span class="s">'d2'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">datediff</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">d2</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">d1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'diff'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(diff=32)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.dayofmonth">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">dayofmonth</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#dayofmonth"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.dayofmonth" title="Permalink to this definition">¶</a></dt>
<dd><p>Extract the day of the month of a given date as integer.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'2015-04-08'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">dayofmonth</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'day'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(day=8)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.dayofyear">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">dayofyear</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#dayofyear"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.dayofyear" title="Permalink to this definition">¶</a></dt>
<dd><p>Extract the day of the year of a given date as integer.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'2015-04-08'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">dayofyear</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'day'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(day=98)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.decode">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">decode</code><span class="sig-paren">(</span><em>col</em>, <em>charset</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#decode"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.decode" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the first argument into a string from a binary using the provided character set
(one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.denseRank">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">denseRank</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.denseRank" title="Permalink to this definition">¶</a></dt>
<dd><p>Window function: returns the rank of rows within a window partition, without any gaps.</p>
<p>The difference between rank and denseRank is that denseRank leaves no gaps in ranking
sequence when there are ties. That is, if you were ranking a competition using denseRank
and had three people tie for second place, you would say that all three were in second
place and that the next person came in third.</p>
<p>This is equivalent to the DENSE_RANK function in SQL.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.desc">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">desc</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.desc" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a sort expression based on the descending order of the given column name.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.encode">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">encode</code><span class="sig-paren">(</span><em>col</em>, <em>charset</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#encode"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.encode" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the first argument into a binary from a string using the provided character set
(one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.exp">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">exp</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.exp" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the exponential of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.explode">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">explode</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#explode"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.explode" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new row for each element in the given array or map.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">Row</span>
<span class="gp">>>> </span><span class="n">eDF</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([</span><span class="n">Row</span><span class="p">(</span><span class="n">a</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">intlist</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">],</span> <span class="n">mapfield</span><span class="o">=</span><span class="p">{</span><span class="s">"a"</span><span class="p">:</span> <span class="s">"b"</span><span class="p">})])</span>
<span class="gp">>>> </span><span class="n">eDF</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">explode</span><span class="p">(</span><span class="n">eDF</span><span class="o">.</span><span class="n">intlist</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"anInt"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(anInt=1), Row(anInt=2), Row(anInt=3)]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">eDF</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">explode</span><span class="p">(</span><span class="n">eDF</span><span class="o">.</span><span class="n">mapfield</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"key"</span><span class="p">,</span> <span class="s">"value"</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|key|value|</span>
<span class="go">+---+-----+</span>
<span class="go">| a| b|</span>
<span class="go">+---+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.expm1">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">expm1</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.expm1" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the exponential of the given value minus one.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.expr">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">expr</code><span class="sig-paren">(</span><em>str</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#expr"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.expr" title="Permalink to this definition">¶</a></dt>
<dd><p>Parses the expression string into the column that it represents</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">expr</span><span class="p">(</span><span class="s">"length(name)"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row('length(name)=5), Row('length(name)=3)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.factorial">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">factorial</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#factorial"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.factorial" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the factorial of the given value.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">5</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'n'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">factorial</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">n</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'f'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(f=120)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.first">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">first</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.first" title="Permalink to this definition">¶</a></dt>
<dd><p>Aggregate function: returns the first value in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.floor">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">floor</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.floor" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the floor of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.format_number">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">format_number</code><span class="sig-paren">(</span><em>col</em>, <em>d</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#format_number"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.format_number" title="Permalink to this definition">¶</a></dt>
<dd><p>Formats the number X to a format like ‘#,–#,–#.–’, rounded to d decimal places,
and returns the result as a string.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col</strong> – the column name of the numeric value to be formatted</li>
<li><strong>d</strong> – the N decimal places</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">5</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">format_number</span><span class="p">(</span><span class="s">'a'</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'v'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(v=u'5.0000')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.format_string">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">format_string</code><span class="sig-paren">(</span><em>format</em>, <em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#format_string"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.format_string" title="Permalink to this definition">¶</a></dt>
<dd><p>Formats the arguments in printf-style and returns the result as a string column.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col</strong> – the column name of the numeric value to be formatted</li>
<li><strong>d</strong> – the N decimal places</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">5</span><span class="p">,</span> <span class="s">"hello"</span><span class="p">)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">,</span> <span class="s">'b'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">format_string</span><span class="p">(</span><span class="s">'</span><span class="si">%d</span><span class="s"> </span><span class="si">%s</span><span class="s">'</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">a</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">b</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'v'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(v=u'5 hello')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.from_unixtime">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">from_unixtime</code><span class="sig-paren">(</span><em>timestamp</em>, <em>format='yyyy-MM-dd HH:mm:ss'</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#from_unixtime"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.from_unixtime" title="Permalink to this definition">¶</a></dt>
<dd><p>Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string
representing the timestamp of that moment in the current system time zone in the given
format.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.from_utc_timestamp">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">from_utc_timestamp</code><span class="sig-paren">(</span><em>timestamp</em>, <em>tz</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#from_utc_timestamp"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.from_utc_timestamp" title="Permalink to this definition">¶</a></dt>
<dd><p>Assumes given timestamp is UTC and converts to given timezone.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'1997-02-28 10:30:00'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'t'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">from_utc_timestamp</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">t</span><span class="p">,</span> <span class="s">"PST"</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'t'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(t=datetime.datetime(1997, 2, 28, 2, 30))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.greatest">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">greatest</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#greatest"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.greatest" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the greatest value of the list of column names, skipping null values.
This function takes at least 2 parameters. It will return null iff all parameters are null.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">3</span><span class="p">)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">,</span> <span class="s">'b'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">greatest</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">a</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">b</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"greatest"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(greatest=4)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.hex">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">hex</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#hex"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.hex" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes hex value of the given column, which could be StringType,
BinaryType, IntegerType or LongType.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'ABC'</span><span class="p">,</span> <span class="mi">3</span><span class="p">)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">,</span> <span class="s">'b'</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="nb">hex</span><span class="p">(</span><span class="s">'a'</span><span class="p">),</span> <span class="nb">hex</span><span class="p">(</span><span class="s">'b'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(hex(a)=u'414243', hex(b)=u'3')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.hour">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">hour</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#hour"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.hour" title="Permalink to this definition">¶</a></dt>
<dd><p>Extract the hours of a given date as integer.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'2015-04-08 13:08:15'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">hour</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'hour'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(hour=13)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.hypot">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">hypot</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.hypot" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes <cite>sqrt(a^2^ + b^2^)</cite> without intermediate overflow or underflow.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.initcap">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">initcap</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#initcap"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.initcap" title="Permalink to this definition">¶</a></dt>
<dd><p>Translate the first letter of each word to upper case in the sentence.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'ab cd'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">initcap</span><span class="p">(</span><span class="s">"a"</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'v'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(v=u'Ab Cd')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.instr">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">instr</code><span class="sig-paren">(</span><em>str</em>, <em>substr</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#instr"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.instr" title="Permalink to this definition">¶</a></dt>
<dd><p>Locate the position of the first occurrence of substr column in the given string.
Returns null if either of the arguments are null.</p>
<p>NOTE: The position is not zero based, but 1 based index, returns 0 if substr
could not be found in str.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'abcd'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'s'</span><span class="p">,])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">instr</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="s">'b'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'s'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.lag">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">lag</code><span class="sig-paren">(</span><em>col</em>, <em>count=1</em>, <em>default=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#lag"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.lag" title="Permalink to this definition">¶</a></dt>
<dd><p>Window function: returns the value that is <cite>offset</cite> rows before the current row, and
<cite>defaultValue</cite> if there is less than <cite>offset</cite> rows before the current row. For example,
an <cite>offset</cite> of one will return the previous row at any given point in the window partition.</p>
<p>This is equivalent to the LAG function in SQL.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col</strong> – name of column or expression</li>
<li><strong>count</strong> – number of row to extend</li>
<li><strong>default</strong> – default value</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.last">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">last</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.last" title="Permalink to this definition">¶</a></dt>
<dd><p>Aggregate function: returns the last value in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.last_day">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">last_day</code><span class="sig-paren">(</span><em>date</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#last_day"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.last_day" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the last day of the month which the given date belongs to.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'1997-02-10'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'d'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">last_day</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'date'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(date=datetime.date(1997, 2, 28))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.lead">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">lead</code><span class="sig-paren">(</span><em>col</em>, <em>count=1</em>, <em>default=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#lead"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.lead" title="Permalink to this definition">¶</a></dt>
<dd><p>Window function: returns the value that is <cite>offset</cite> rows after the current row, and
<cite>defaultValue</cite> if there is less than <cite>offset</cite> rows after the current row. For example,
an <cite>offset</cite> of one will return the next row at any given point in the window partition.</p>
<p>This is equivalent to the LEAD function in SQL.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>col</strong> – name of column or expression</li>
<li><strong>count</strong> – number of row to extend</li>
<li><strong>default</strong> – default value</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.least">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">least</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#least"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.least" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the least value of the list of column names, skipping null values.
This function takes at least 2 parameters. It will return null iff all parameters are null.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">3</span><span class="p">)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">,</span> <span class="s">'b'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">least</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">a</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">b</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">c</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"least"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(least=1)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.length">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">length</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#length"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.length" title="Permalink to this definition">¶</a></dt>
<dd><p>Calculates the length of a string or binary expression.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'ABC'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">length</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'length'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(length=3)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.levenshtein">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">levenshtein</code><span class="sig-paren">(</span><em>left</em>, <em>right</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#levenshtein"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.levenshtein" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the Levenshtein distance of the two given strings.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df0</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'kitten'</span><span class="p">,</span> <span class="s">'sitting'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'l'</span><span class="p">,</span> <span class="s">'r'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df0</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">levenshtein</span><span class="p">(</span><span class="s">'l'</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'d'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(d=3)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.lit">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">lit</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.lit" title="Permalink to this definition">¶</a></dt>
<dd><p>Creates a <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> of literal value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.locate">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">locate</code><span class="sig-paren">(</span><em>substr</em>, <em>str</em>, <em>pos=0</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#locate"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.locate" title="Permalink to this definition">¶</a></dt>
<dd><p>Locate the position of the first occurrence of substr in a string column, after position pos.</p>
<p>NOTE: The position is not zero based, but 1 based index. returns 0 if substr
could not be found in str.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>substr</strong> – a string</li>
<li><strong>str</strong> – a Column of StringType</li>
<li><strong>pos</strong> – start position (zero based)</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'abcd'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'s'</span><span class="p">,])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">locate</span><span class="p">(</span><span class="s">'b'</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'s'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.log">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">log</code><span class="sig-paren">(</span><em>arg1</em>, <em>arg2=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#log"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.log" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the first argument-based logarithm of the second argument.</p>
<p>If there is only one argument, then this takes the natural logarithm of the argument.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">log</span><span class="p">(</span><span class="mf">10.0</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'ten'</span><span class="p">))</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">l</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">l</span><span class="o">.</span><span class="n">ten</span><span class="p">)[:</span><span class="mi">7</span><span class="p">])</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">['0.30102', '0.69897']</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">log</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'e'</span><span class="p">))</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">l</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">l</span><span class="o">.</span><span class="n">e</span><span class="p">)[:</span><span class="mi">7</span><span class="p">])</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">['0.69314', '1.60943']</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.log10">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">log10</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.log10" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the logarithm of the given value in Base 10.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.log1p">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">log1p</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.log1p" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the natural logarithm of the given value plus one.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.log2">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">log2</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#log2"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.log2" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the base-2 logarithm of the argument.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">4</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">log2</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'log2'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(log2=2.0)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.lower">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">lower</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.lower" title="Permalink to this definition">¶</a></dt>
<dd><p>Converts a string column to lower case.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.lpad">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">lpad</code><span class="sig-paren">(</span><em>col</em>, <em>len</em>, <em>pad</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#lpad"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.lpad" title="Permalink to this definition">¶</a></dt>
<dd><p>Left-pad the string column to width <cite>len</cite> with <cite>pad</cite>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'abcd'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'s'</span><span class="p">,])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">lpad</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="s">'#'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'s'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=u'##abcd')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.ltrim">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">ltrim</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.ltrim" title="Permalink to this definition">¶</a></dt>
<dd><p>Trim the spaces from right end for the specified string value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.max">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">max</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.max" title="Permalink to this definition">¶</a></dt>
<dd><p>Aggregate function: returns the maximum value of the expression in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.md5">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">md5</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#md5"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.md5" title="Permalink to this definition">¶</a></dt>
<dd><p>Calculates the MD5 digest and returns the value as a 32 character hex string.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'ABC'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">md5</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'hash'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(hash=u'902fbdd2b1df0c4f70b4a5d23525e932')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.mean">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">mean</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.mean" title="Permalink to this definition">¶</a></dt>
<dd><p>Aggregate function: returns the average of the values in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.min">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">min</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.min" title="Permalink to this definition">¶</a></dt>
<dd><p>Aggregate function: returns the minimum value of the expression in a group.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.minute">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">minute</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#minute"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.minute" title="Permalink to this definition">¶</a></dt>
<dd><p>Extract the minutes of a given date as integer.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'2015-04-08 13:08:15'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">minute</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'minute'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(minute=8)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.monotonicallyIncreasingId">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">monotonicallyIncreasingId</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#monotonicallyIncreasingId"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.monotonicallyIncreasingId" title="Permalink to this definition">¶</a></dt>
<dd><p>A column that generates monotonically increasing 64-bit integers.</p>
<p>The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
The current implementation puts the partition ID in the upper 31 bits, and the record number
within each partition in the lower 33 bits. The assumption is that the data frame has
less than 1 billion partitions, and each partition has less than 8 billion records.</p>
<p>As an example, consider a <code class="xref py py-class docutils literal"><span class="pre">DataFrame</span></code> with two partitions, each with 3 records.
This expression would return the following IDs:
0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df0</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">mapPartitions</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="p">[(</span><span class="mi">1</span><span class="p">,),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,),</span> <span class="p">(</span><span class="mi">3</span><span class="p">,)])</span><span class="o">.</span><span class="n">toDF</span><span class="p">([</span><span class="s">'col1'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df0</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">monotonicallyIncreasingId</span><span class="p">()</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'id'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(id=0), Row(id=1), Row(id=2), Row(id=8589934592), Row(id=8589934593), Row(id=8589934594)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.month">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">month</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#month"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.month" title="Permalink to this definition">¶</a></dt>
<dd><blockquote>
<div><p>Extract the month of a given date as integer.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'2015-04-08'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">month</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'month'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(month=4)]</span>
</pre></div>
</div>
</div></blockquote>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.months_between">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">months_between</code><span class="sig-paren">(</span><em>date1</em>, <em>date2</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#months_between"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.months_between" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the number of months between date1 and date2.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'1997-02-28 10:30:00'</span><span class="p">,</span> <span class="s">'1996-10-30'</span><span class="p">)],</span> <span class="p">[</span><span class="s">'t'</span><span class="p">,</span> <span class="s">'d'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">months_between</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">t</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'months'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(months=3.9495967...)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.next_day">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">next_day</code><span class="sig-paren">(</span><em>date</em>, <em>dayOfWeek</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#next_day"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.next_day" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the first date which is later than the value of the date column.</p>
<dl class="docutils">
<dt>Day of the week parameter is case insensitive, and accepts:</dt>
<dd>“Mon”, “Tue”, “Wed”, “Thu”, “Fri”, “Sat”, “Sun”.</dd>
</dl>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'2015-07-27'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'d'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">next_day</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">,</span> <span class="s">'Sun'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'date'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(date=datetime.date(2015, 8, 2))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.ntile">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">ntile</code><span class="sig-paren">(</span><em>n</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#ntile"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.ntile" title="Permalink to this definition">¶</a></dt>
<dd><p>Window function: returns the ntile group id (from 1 to <cite>n</cite> inclusive)
in an ordered window partition. For example, if <cite>n</cite> is 4, the first
quarter of the rows will get value 1, the second quarter will get 2,
the third quarter will get 3, and the last quarter will get 4.</p>
<p>This is equivalent to the NTILE function in SQL.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>n</strong> – an integer</td>
</tr>
</tbody>
</table>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.percentRank">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">percentRank</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.percentRank" title="Permalink to this definition">¶</a></dt>
<dd><p>Window function: returns the relative rank (i.e. percentile) of rows within a window partition.</p>
<p>This is equivalent to the PERCENT_RANK function in SQL.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.pow">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">pow</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.pow" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the value of the first argument raised to the power of the second argument.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.quarter">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">quarter</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#quarter"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.quarter" title="Permalink to this definition">¶</a></dt>
<dd><p>Extract the quarter of a given date as integer.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'2015-04-08'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">quarter</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'quarter'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(quarter=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.rand">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">rand</code><span class="sig-paren">(</span><em>seed=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#rand"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.rand" title="Permalink to this definition">¶</a></dt>
<dd><p>Generates a random column with i.i.d. samples from U[0.0, 1.0].</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.randn">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">randn</code><span class="sig-paren">(</span><em>seed=None</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#randn"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.randn" title="Permalink to this definition">¶</a></dt>
<dd><p>Generates a column with i.i.d. samples from the standard normal distribution.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.rank">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">rank</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.rank" title="Permalink to this definition">¶</a></dt>
<dd><p>Window function: returns the rank of rows within a window partition.</p>
<p>The difference between rank and denseRank is that denseRank leaves no gaps in ranking
sequence when there are ties. That is, if you were ranking a competition using denseRank
and had three people tie for second place, you would say that all three were in second
place and that the next person came in third.</p>
<p>This is equivalent to the RANK function in SQL.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.regexp_extract">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">regexp_extract</code><span class="sig-paren">(</span><em>str</em>, <em>pattern</em>, <em>idx</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#regexp_extract"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.regexp_extract" title="Permalink to this definition">¶</a></dt>
<dd><p>Extract a specific(idx) group identified by a java regex, from the specified string column.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'100-200'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'str'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">regexp_extract</span><span class="p">(</span><span class="s">'str'</span><span class="p">,</span> <span class="s">'(\d+)-(\d+)'</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'d'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(d=u'100')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.regexp_replace">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">regexp_replace</code><span class="sig-paren">(</span><em>str</em>, <em>pattern</em>, <em>replacement</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#regexp_replace"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.regexp_replace" title="Permalink to this definition">¶</a></dt>
<dd><p>Replace all substrings of the specified string value that match regexp with rep.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'100-200'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'str'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">regexp_replace</span><span class="p">(</span><span class="s">'str'</span><span class="p">,</span> <span class="s">'(\d+)'</span><span class="p">,</span> <span class="s">'--'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'d'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(d=u'-----')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.repeat">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">repeat</code><span class="sig-paren">(</span><em>col</em>, <em>n</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#repeat"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.repeat" title="Permalink to this definition">¶</a></dt>
<dd><p>Repeats a string column n times, and returns it as a new string column.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'ab'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'s'</span><span class="p">,])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">repeat</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'s'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=u'ababab')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.reverse">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">reverse</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.reverse" title="Permalink to this definition">¶</a></dt>
<dd><p>Reverses the string column and returns it as a new string column.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.rint">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">rint</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.rint" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the double value that is closest in value to the argument and is equal to a mathematical integer.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.round">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">round</code><span class="sig-paren">(</span><em>col</em>, <em>scale=0</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#round"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.round" title="Permalink to this definition">¶</a></dt>
<dd><p>Round the value of <cite>e</cite> to <cite>scale</cite> decimal places if <cite>scale</cite> >= 0
or at integral part when <cite>scale</cite> < 0.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mf">2.546</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="nb">round</span><span class="p">(</span><span class="s">'a'</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'r'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r=2.5)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.rowNumber">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">rowNumber</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.rowNumber" title="Permalink to this definition">¶</a></dt>
<dd><p>Window function: returns a sequential number starting at 1 within a window partition.</p>
<p>This is equivalent to the ROW_NUMBER function in SQL.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.rpad">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">rpad</code><span class="sig-paren">(</span><em>col</em>, <em>len</em>, <em>pad</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#rpad"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.rpad" title="Permalink to this definition">¶</a></dt>
<dd><p>Right-pad the string column to width <cite>len</cite> with <cite>pad</cite>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'abcd'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'s'</span><span class="p">,])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">rpad</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="s">'#'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'s'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=u'abcd##')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.rtrim">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">rtrim</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.rtrim" title="Permalink to this definition">¶</a></dt>
<dd><p>Trim the spaces from right end for the specified string value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.second">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">second</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#second"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.second" title="Permalink to this definition">¶</a></dt>
<dd><p>Extract the seconds of a given date as integer.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'2015-04-08 13:08:15'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">second</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'second'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(second=15)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sha1">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sha1</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#sha1"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.sha1" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the hex string result of SHA-1.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'ABC'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">sha1</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'hash'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(hash=u'3c01bdbb26f358bab27f267924aa2c9a03fcfdb8')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sha2">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sha2</code><span class="sig-paren">(</span><em>col</em>, <em>numBits</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#sha2"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.sha2" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384,
and SHA-512). The numBits indicates the desired bit length of the result, which must have a
value of 224, 256, 384, 512, or 0 (which is equivalent to 256).</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">digests</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">sha2</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="mi">256</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'s'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="gp">>>> </span><span class="n">digests</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="go">Row(s=u'3bc51062973c458d5a6f2d8d64a023246354ad7e064b1e4e009ec8a0699a3043')</span>
<span class="gp">>>> </span><span class="n">digests</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="go">Row(s=u'cd9fb1e148ccd8442e5aa74904cc73bf6fb54d1d54d333bd596aa9bb4bb4e961')</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.shiftLeft">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">shiftLeft</code><span class="sig-paren">(</span><em>col</em>, <em>numBits</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#shiftLeft"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.shiftLeft" title="Permalink to this definition">¶</a></dt>
<dd><p>Shift the the given value numBits left.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">21</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">shiftLeft</span><span class="p">(</span><span class="s">'a'</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'r'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r=42)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.shiftRight">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">shiftRight</code><span class="sig-paren">(</span><em>col</em>, <em>numBits</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#shiftRight"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.shiftRight" title="Permalink to this definition">¶</a></dt>
<dd><p>Shift the the given value numBits right.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="mi">42</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">shiftRight</span><span class="p">(</span><span class="s">'a'</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'r'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r=21)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.shiftRightUnsigned">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">shiftRightUnsigned</code><span class="sig-paren">(</span><em>col</em>, <em>numBits</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#shiftRightUnsigned"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.shiftRightUnsigned" title="Permalink to this definition">¶</a></dt>
<dd><p>Unsigned shift the the given value numBits right.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="o">-</span><span class="mi">42</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">shiftRightUnsigned</span><span class="p">(</span><span class="s">'a'</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'r'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r=9223372036854775787)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.signum">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">signum</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.signum" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the signum of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sin">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sin</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.sin" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the sine of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sinh">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sinh</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.sinh" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the hyperbolic sine of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.size">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">size</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#size"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.size" title="Permalink to this definition">¶</a></dt>
<dd><p>Collection function: returns the length of the array or map stored in the column.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>col</strong> – name of column or expression</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">],),([</span><span class="mi">1</span><span class="p">],),([],)],</span> <span class="p">[</span><span class="s">'data'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">size</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">data</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(size(data)=3), Row(size(data)=1), Row(size(data)=0)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sort_array">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sort_array</code><span class="sig-paren">(</span><em>col</em>, <em>asc=True</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#sort_array"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.sort_array" title="Permalink to this definition">¶</a></dt>
<dd><p>Collection function: sorts the input array for the given column in ascending order.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>col</strong> – name of column or expression</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([([</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">],),([</span><span class="mi">1</span><span class="p">],),([],)],</span> <span class="p">[</span><span class="s">'data'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">sort_array</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">data</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'r'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r=[1, 2, 3]), Row(r=[1]), Row(r=[])]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">sort_array</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">data</span><span class="p">,</span> <span class="n">asc</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'r'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r=[3, 2, 1]), Row(r=[1]), Row(r=[])]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.soundex">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">soundex</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#soundex"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.soundex" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the SoundEx encoding for a string</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">"Peters"</span><span class="p">,),(</span><span class="s">"Uhrbach"</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'name'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">soundex</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"soundex"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(soundex=u'P362'), Row(soundex=u'U612')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sparkPartitionId">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sparkPartitionId</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#sparkPartitionId"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.sparkPartitionId" title="Permalink to this definition">¶</a></dt>
<dd><p>A column for partition ID of the Spark task.</p>
<p>Note that this is indeterministic because it depends on data partitioning and task scheduling.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">repartition</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">sparkPartitionId</span><span class="p">()</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"pid"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(pid=0), Row(pid=0)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.split">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">split</code><span class="sig-paren">(</span><em>str</em>, <em>pattern</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#split"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.split" title="Permalink to this definition">¶</a></dt>
<dd><p>Splits str around pattern (pattern is a regular expression).</p>
<p>NOTE: pattern is a string represent the regular expression.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'ab12cd'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'s'</span><span class="p">,])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">split</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="s">'[0-9]+'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'s'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=[u'ab', u'cd'])]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sqrt">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sqrt</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.sqrt" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the square root of the specified float value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.struct">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">struct</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#struct"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.struct" title="Permalink to this definition">¶</a></dt>
<dd><p>Creates a new struct column.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>cols</strong> – list of column names (string) or list of <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> expressions</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">struct</span><span class="p">(</span><span class="s">'age'</span><span class="p">,</span> <span class="s">'name'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"struct"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(struct=Row(age=2, name=u'Alice')), Row(struct=Row(age=5, name=u'Bob'))]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">struct</span><span class="p">([</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">])</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"struct"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(struct=Row(age=2, name=u'Alice')), Row(struct=Row(age=5, name=u'Bob'))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.substring">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">substring</code><span class="sig-paren">(</span><em>str</em>, <em>pos</em>, <em>len</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#substring"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.substring" title="Permalink to this definition">¶</a></dt>
<dd><p>Substring starts at <cite>pos</cite> and is of length <cite>len</cite> when str is String type or
returns the slice of byte array that starts at <cite>pos</cite> in byte and is of length <cite>len</cite>
when str is Binary type</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'abcd'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'s'</span><span class="p">,])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">substring</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'s'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=u'ab')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.substring_index">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">substring_index</code><span class="sig-paren">(</span><em>str</em>, <em>delim</em>, <em>count</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#substring_index"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.substring_index" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the substring from string str before count occurrences of the delimiter delim.
If count is positive, everything the left of the final delimiter (counting from left) is
returned. If count is negative, every to the right of the final delimiter (counting from the
right) is returned. substring_index performs a case-sensitive match when searching for delim.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'a.b.c.d'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'s'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">substring_index</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="s">'.'</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'s'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=u'a.b')]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">substring_index</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">s</span><span class="p">,</span> <span class="s">'.'</span><span class="p">,</span> <span class="o">-</span><span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'s'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(s=u'b.c.d')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sum">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sum</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.sum" title="Permalink to this definition">¶</a></dt>
<dd><p>Aggregate function: returns the sum of all values in the expression.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.sumDistinct">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">sumDistinct</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.sumDistinct" title="Permalink to this definition">¶</a></dt>
<dd><p>Aggregate function: returns the sum of distinct values in the expression.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.tan">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">tan</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.tan" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the tangent of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.tanh">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">tanh</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.tanh" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes the hyperbolic tangent of the given value.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.toDegrees">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">toDegrees</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.toDegrees" title="Permalink to this definition">¶</a></dt>
<dd><p>Converts an angle measured in radians to an approximately equivalent angle measured in degrees.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.toRadians">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">toRadians</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.toRadians" title="Permalink to this definition">¶</a></dt>
<dd><p>Converts an angle measured in degrees to an approximately equivalent angle measured in radians.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.to_date">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">to_date</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#to_date"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.to_date" title="Permalink to this definition">¶</a></dt>
<dd><p>Converts the column of StringType or TimestampType into DateType.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'1997-02-28 10:30:00'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'t'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">to_date</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">t</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'date'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(date=datetime.date(1997, 2, 28))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.to_utc_timestamp">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">to_utc_timestamp</code><span class="sig-paren">(</span><em>timestamp</em>, <em>tz</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#to_utc_timestamp"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.to_utc_timestamp" title="Permalink to this definition">¶</a></dt>
<dd><p>Assumes given timestamp is in given timezone and converts to UTC.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'1997-02-28 10:30:00'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'t'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">to_utc_timestamp</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">t</span><span class="p">,</span> <span class="s">"PST"</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'t'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(t=datetime.datetime(1997, 2, 28, 18, 30))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.translate">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">translate</code><span class="sig-paren">(</span><em>srcCol</em>, <em>matching</em>, <em>replace</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#translate"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.translate" title="Permalink to this definition">¶</a></dt>
<dd><p>A function translate any character in the <cite>srcCol</cite> by a character in <cite>matching</cite>.
The characters in <cite>replace</cite> is corresponding to the characters in <cite>matching</cite>.
The translate will happen when any character in the string matching with the character
in the <cite>matching</cite>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'translate'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">translate</span><span class="p">(</span><span class="s">'a'</span><span class="p">,</span> <span class="s">"rnlt"</span><span class="p">,</span> <span class="s">"123"</span><span class="p">)</span> <span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'r'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(r=u'1a2s3ae')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.trim">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">trim</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.trim" title="Permalink to this definition">¶</a></dt>
<dd><p>Trim the spaces from both ends for the specified string column.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.trunc">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">trunc</code><span class="sig-paren">(</span><em>date</em>, <em>format</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#trunc"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.trunc" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns date truncated to the unit specified by the format.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>format</strong> – ‘year’, ‘YYYY’, ‘yy’ or ‘month’, ‘mon’, ‘mm’</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'1997-02-28'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'d'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">trunc</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">,</span> <span class="s">'year'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'year'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(year=datetime.date(1997, 1, 1))]</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">trunc</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">d</span><span class="p">,</span> <span class="s">'mon'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'month'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(month=datetime.date(1997, 2, 1))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.udf">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">udf</code><span class="sig-paren">(</span><em>f</em>, <em>returnType=StringType</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#udf"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.udf" title="Permalink to this definition">¶</a></dt>
<dd><p>Creates a <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> expression representing a user defined function (UDF).</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pyspark.sql.types</span> <span class="kn">import</span> <span class="n">IntegerType</span>
<span class="gp">>>> </span><span class="n">slen</span> <span class="o">=</span> <span class="n">udf</span><span class="p">(</span><span class="k">lambda</span> <span class="n">s</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">s</span><span class="p">),</span> <span class="n">IntegerType</span><span class="p">())</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">slen</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'slen'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(slen=5), Row(slen=3)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.3.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.unbase64">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">unbase64</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.unbase64" title="Permalink to this definition">¶</a></dt>
<dd><p>Decodes a BASE64 encoded string column and returns it as a binary column.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.unhex">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">unhex</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#unhex"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.unhex" title="Permalink to this definition">¶</a></dt>
<dd><p>Inverse of hex. Interprets each pair of characters as a hexadecimal number
and converts to the byte representation of number.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'414243'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">unhex</span><span class="p">(</span><span class="s">'a'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(unhex(a)=bytearray(b'ABC'))]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.unix_timestamp">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">unix_timestamp</code><span class="sig-paren">(</span><em>timestamp=None</em>, <em>format='yyyy-MM-dd HH:mm:ss'</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#unix_timestamp"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.unix_timestamp" title="Permalink to this definition">¶</a></dt>
<dd><p>Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default)
to Unix time stamp (in seconds), using the default timezone and the default
locale, return null if fail.</p>
<p>if <cite>timestamp</cite> is None, then it returns current timestamp.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.upper">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">upper</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.functions.upper" title="Permalink to this definition">¶</a></dt>
<dd><p>Converts a string column to upper case.</p>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.weekofyear">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">weekofyear</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#weekofyear"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.weekofyear" title="Permalink to this definition">¶</a></dt>
<dd><p>Extract the week number of a given date as integer.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'2015-04-08'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">weekofyear</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">a</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'week'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(week=15)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.when">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">when</code><span class="sig-paren">(</span><em>condition</em>, <em>value</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#when"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.when" title="Permalink to this definition">¶</a></dt>
<dd><p>Evaluates a list of conditions and returns one of multiple possible result expressions.
If <code class="xref py py-func docutils literal"><span class="pre">Column.otherwise()</span></code> is not invoked, None is returned for unmatched conditions.</p>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
<li><strong>condition</strong> – a boolean <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> expression.</li>
<li><strong>value</strong> – a literal value, or a <code class="xref py py-class docutils literal"><span class="pre">Column</span></code> expression.</li>
</ul>
</td>
</tr>
</tbody>
</table>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">when</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'age'</span><span class="p">]</span> <span class="o">==</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">otherwise</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"age"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=3), Row(age=4)]</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">when</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">==</span> <span class="mi">2</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"age"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=3), Row(age=None)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.4.</span></p>
</div>
</dd></dl>
<dl class="function">
<dt id="pyspark.sql.functions.year">
<code class="descclassname">pyspark.sql.functions.</code><code class="descname">year</code><span class="sig-paren">(</span><em>col</em><span class="sig-paren">)</span><a class="reference internal" href="_modules/pyspark/sql/functions.html#year"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.functions.year" title="Permalink to this definition">¶</a></dt>
<dd><p>Extract the year of a given date as integer.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">df</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s">'2015-04-08'</span><span class="p">,)],</span> <span class="p">[</span><span class="s">'a'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">year</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s">'year'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(year=2015)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified">New in version 1.5.</span></p>
</div>
</dd></dl>
</div>
</div>
</div>
</div>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="main navigation">
<div class="sphinxsidebarwrapper">
<p class="logo"><a href="index.html">
<img class="logo" src="_static/spark-logo-hd.png" alt="Logo"/>
</a></p>
<h3><a href="index.html">Table Of Contents</a></h3>
<ul>
<li><a class="reference internal" href="#">pyspark.sql module</a><ul>
<li><a class="reference internal" href="#module-pyspark.sql">Module Context</a></li>
<li><a class="reference internal" href="#module-pyspark.sql.types">pyspark.sql.types module</a></li>
<li><a class="reference internal" href="#module-pyspark.sql.functions">pyspark.sql.functions module</a></li>
</ul>
</li>
</ul>
<h4>Previous topic</h4>
<p class="topless"><a href="pyspark.mllib.html"
title="previous chapter">pyspark.mllib package</a></p>
<h4>Next topic</h4>
<p class="topless"><a href="pyspark.streaming.html"
title="next chapter">pyspark.streaming module</a></p>
<div role="note" aria-label="source link">
<h3>This Page</h3>
<ul class="this-page-menu">
<li><a href="_sources/pyspark.sql.txt"
rel="nofollow">Show Source</a></li>
</ul>
</div>
<div id="searchbox" style="display: none" role="search">
<h3>Quick search</h3>
<form class="search" action="search.html" method="get">
<input type="text" name="q" />
<input type="submit" value="Go" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
<p class="searchtip" style="font-size: 90%">
Enter search terms or a module, class or function name.
</p>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="related" role="navigation" aria-label="related navigation">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="pyspark.streaming.html" title="pyspark.streaming module"
>next</a></li>
<li class="right" >
<a href="pyspark.mllib.html" title="pyspark.mllib package"
>previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">PySpark master documentation</a> »</li>
</ul>
</div>
<div class="footer" role="contentinfo">
© Copyright .
Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.3.1.
</div>
</body>
</html>