aboutsummaryrefslogtreecommitdiff
path: root/docs/storage-openstack-swift.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/storage-openstack-swift.md')
-rw-r--r--docs/storage-openstack-swift.md152
1 files changed, 152 insertions, 0 deletions
diff --git a/docs/storage-openstack-swift.md b/docs/storage-openstack-swift.md
new file mode 100644
index 0000000000..c39ef1ce59
--- /dev/null
+++ b/docs/storage-openstack-swift.md
@@ -0,0 +1,152 @@
+---
+layout: global
+title: Accessing OpenStack Swift from Spark
+---
+
+Spark's support for Hadoop InputFormat allows it to process data in OpenStack Swift using the
+same URI formats as in Hadoop. You can specify a path in Swift as input through a
+URI of the form <code>swift://container.PROVIDER/path</code>. You will also need to set your
+Swift security credentials, through <code>core-site.xml</code> or via
+<code>SparkContext.hadoopConfiguration</code>.
+Current Swift driver requires Swift to use Keystone authentication method.
+
+# Configuring Swift for Better Data Locality
+
+Although not mandatory, it is recommended to configure the proxy server of Swift with
+<code>list_endpoints</code> to have better data locality. More information is
+[available here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py).
+
+
+# Dependencies
+
+The Spark application should include <code>hadoop-openstack</code> dependency.
+For example, for Maven support, add the following to the <code>pom.xml</code> file:
+
+{% highlight xml %}
+<dependencyManagement>
+ ...
+ <dependency>
+ <groupId>org.apache.hadoop</groupId>
+ <artifactId>hadoop-openstack</artifactId>
+ <version>2.3.0</version>
+ </dependency>
+ ...
+</dependencyManagement>
+{% endhighlight %}
+
+
+# Configuration Parameters
+
+Create <code>core-site.xml</code> and place it inside Spark's <code>conf</code> directory.
+There are two main categories of parameters that should to be configured: declaration of the
+Swift driver and the parameters that are required by Keystone.
+
+Configuration of Hadoop to use Swift File system achieved via
+
+<table class="table">
+<tr><th>Property Name</th><th>Value</th></tr>
+<tr>
+ <td>fs.swift.impl</td>
+ <td>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</td>
+</tr>
+</table>
+
+Additional parameters required by Keystone (v2.0) and should be provided to the Swift driver. Those
+parameters will be used to perform authentication in Keystone to access Swift. The following table
+contains a list of Keystone mandatory parameters. <code>PROVIDER</code> can be any name.
+
+<table class="table">
+<tr><th>Property Name</th><th>Meaning</th><th>Required</th></tr>
+<tr>
+ <td><code>fs.swift.service.PROVIDER.auth.url</code></td>
+ <td>Keystone Authentication URL</td>
+ <td>Mandatory</td>
+</tr>
+<tr>
+ <td><code>fs.swift.service.PROVIDER.auth.endpoint.prefix</code></td>
+ <td>Keystone endpoints prefix</td>
+ <td>Optional</td>
+</tr>
+<tr>
+ <td><code>fs.swift.service.PROVIDER.tenant</code></td>
+ <td>Tenant</td>
+ <td>Mandatory</td>
+</tr>
+<tr>
+ <td><code>fs.swift.service.PROVIDER.username</code></td>
+ <td>Username</td>
+ <td>Mandatory</td>
+</tr>
+<tr>
+ <td><code>fs.swift.service.PROVIDER.password</code></td>
+ <td>Password</td>
+ <td>Mandatory</td>
+</tr>
+<tr>
+ <td><code>fs.swift.service.PROVIDER.http.port</code></td>
+ <td>HTTP port</td>
+ <td>Mandatory</td>
+</tr>
+<tr>
+ <td><code>fs.swift.service.PROVIDER.region</code></td>
+ <td>Keystone region</td>
+ <td>Mandatory</td>
+</tr>
+<tr>
+ <td><code>fs.swift.service.PROVIDER.public</code></td>
+ <td>Indicates if all URLs are public</td>
+ <td>Mandatory</td>
+</tr>
+</table>
+
+For example, assume <code>PROVIDER=SparkTest</code> and Keystone contains user <code>tester</code> with password <code>testing</code>
+defined for tenant <code>test</code>. Then <code>core-site.xml</code> should include:
+
+{% highlight xml %}
+<configuration>
+ <property>
+ <name>fs.swift.impl</name>
+ <value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value>
+ </property>
+ <property>
+ <name>fs.swift.service.SparkTest.auth.url</name>
+ <value>http://127.0.0.1:5000/v2.0/tokens</value>
+ </property>
+ <property>
+ <name>fs.swift.service.SparkTest.auth.endpoint.prefix</name>
+ <value>endpoints</value>
+ </property>
+ <name>fs.swift.service.SparkTest.http.port</name>
+ <value>8080</value>
+ </property>
+ <property>
+ <name>fs.swift.service.SparkTest.region</name>
+ <value>RegionOne</value>
+ </property>
+ <property>
+ <name>fs.swift.service.SparkTest.public</name>
+ <value>true</value>
+ </property>
+ <property>
+ <name>fs.swift.service.SparkTest.tenant</name>
+ <value>test</value>
+ </property>
+ <property>
+ <name>fs.swift.service.SparkTest.username</name>
+ <value>tester</value>
+ </property>
+ <property>
+ <name>fs.swift.service.SparkTest.password</name>
+ <value>testing</value>
+ </property>
+</configuration>
+{% endhighlight %}
+
+Notice that
+<code>fs.swift.service.PROVIDER.tenant</code>,
+<code>fs.swift.service.PROVIDER.username</code>,
+<code>fs.swift.service.PROVIDER.password</code> contains sensitive information and keeping them in
+<code>core-site.xml</code> is not always a good approach.
+We suggest to keep those parameters in <code>core-site.xml</code> for testing purposes when running Spark
+via <code>spark-shell</code>.
+For job submissions they should be provided via <code>sparkContext.hadoopConfiguration</code>.