--- layout: global title: Home type: page navigation: weight: 1 show: true --- ## What is Apache Spark? Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop. To make programming faster, Spark provides clean, concise APIs in Python, Scala and Java. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets. ## What can it do? Spark was initially developed for two applications where placing data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining. In both cases, Spark can run up to 100x faster than Hadoop MapReduce. However, you can use Spark for general data processing too. Check out our example jobs. Spark is also the engine behind Shark, a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive. While Spark is a new engine, it can access any data source supported by Hadoop, making it easy to run over existing data. ## Who uses it? Spark was initially developed in the UC Berkeley AMPLab, but is now being used and developed at a wide array of companies, including Yahoo!, Conviva, and Quantifind. In total, over 20 companies have contributed code to Spark. Spark is open source under an Apache license, so download it to check it out. ## Apache Incubator notice Apache Spark is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. {% sidebar %}

Latest News

{% for post in site.categories.news limit:4 %}
{{ post.title }}
{% endfor %}
News Archive
file = spark.textFile("hdfs://...")
 
file.flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
Word Count implemented in Spark
Logistic regression performance in Spark vs Hadoop
Logistic regression in Spark vs Hadoop

Download  Download Spark

{% endsidebar %}