Spark SQL | Apache Spark

Latest News

Spark wins Daytona Gray Sort 100TB Benchmark (Nov 05, 2014)
Submissions open for Spark Summit East 2015 in New York (Oct 18, 2014)
Spark 1.1.0 released (Sep 11, 2014)
Spark 1.0.2 released (Aug 05, 2014)

Integrated

Seemlessly mix SQL queries with Spark programs.

Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark, with integrated APIs in Python, Scala and Java. This tight integration makes it easy to run SQL queries alongside complex analytic algorithms.

sqlCtx = new HiveContext(sc)
results = sqlCtx.sql(
"SELECT * FROM people")
names = results.map(lambda p: p.name)

Apply functions to results of SQL queries.

Unified Data Access

Load and query data from a variety of sources.

SchemaRDDs provide a single interface for efficiently working with structured data, including Apache Hive tables, parquet files and JSON files.

sqlCtx.jsonFile("s3n://...")
  .registerAsTable("json")
schema_rdd = sqlCtx.sql("""
  SELECT *
  FROM hiveTable
  JOIN json ...""")

Query and join different data sources.

Hive Compatibility

Run unmodified Hive queries on existing warehouses.

Spark SQL reuses the Hive frontend and metastore, giving you full compatibility with existing Hive data, queries, and UDFs. Simply install it alongside Hive.

Spark SQL can use existing Hive metastores, SerDes, and UDFs.

Standard Connectivity

Connect through JDBC or ODBC.

Spark SQL includes a server mode with industry standard JDBC and ODBC connectivity.

Use your existing BI tools to query big data.

Scalability

Use the same engine for both interactive and long queries.

Spark SQL takes advantage of the RDD model to support mid-query fault tolerance, letting it scale to large jobs too. Don't worry about using a different engine for historical data.

Community

Spark SQL is developed as part of Apache Spark. It thus gets tested and updated with each Spark release.

If you have questions about the system, ask on the Spark mailing lists.

The Spark SQL developers welcome contributions. If you'd like to help out, read how to contribute to Spark, and send us a patch!

Getting Started

To get started with Spark SQL:

Download Spark. It includes Spark SQL as a module.
Read the Spark SQL programming guide, which includes a examples of common use cases.