From 673dcddb721241a6d7eef2d773a170a1e1a38202 Mon Sep 17 00:00:00 2001 From: Matei Alexandru Zaharia Date: Wed, 22 Jan 2014 20:33:24 +0000 Subject: Update site look and add pages for Streaming and MLlib This monster commit does a variety of things: - Update the site look and feel to be cleaner - Add top-level points to front page - Add a listing of related projects, and pages for those included in Spark - Reorganize docs and community pages - Make sure the site scales properly on mobile devices - Add tabs to let users view the examples in any programming language It's just a start, but should be a step towards a better web presence. --- research.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'research.md') diff --git a/research.md b/research.md index 858acbc31..c7f3070b4 100644 --- a/research.md +++ b/research.md @@ -23,14 +23,14 @@ Our goal was to design a programming model that supports a much wider class of a

-MapReduce and Dryad are suboptimal for these applications because they are based on acyclic data flow: an application has to run as a series of distinct jobs, each of which reads data from stable storage (e.g. a distributed file system) and writes it back to stable storage. They incur significant cost loading the data on each step and writing it back to replicated storage. +Traditional MapReduce and DAG engines are suboptimal for these applications because they are based on acyclic data flow: an application has to run as a series of distinct jobs, each of which reads data from stable storage (e.g. a distributed file system) and writes it back to stable storage. They incur significant cost loading the data on each step and writing it back to replicated storage.

-Spark offers an abstraction called resilient distributed datasets (RDDs) to support these applications efficiently. RDDs can be stored in memory between queries without requiring replication. Instead, they rebuild lost data on failure using lineage: each RDD remembers how it was built from other datasets (by transformations like map, join or group-by) to rebuild itself. RDDs allow Spark to outperform existing models by up to 100x in multi-pass analytics. We showed that RDDs can support a wide variety of iterative algorithms, as well as interactive data mining and a highly efficient SQL engine (the Shark project). +Spark offers an abstraction called resilient distributed datasets (RDDs) to support these applications efficiently. RDDs can be stored in memory between queries without requiring replication. Instead, they rebuild lost data on failure using lineage: each RDD remembers how it was built from other datasets (by transformations like map, join or groupBy) to rebuild itself. RDDs allow Spark to outperform existing models by up to 100x in multi-pass analytics. We showed that RDDs can support a wide variety of iterative algorithms, as well as interactive data mining and a highly efficient SQL engine (Shark).

-

You can find more about the research behind Spark in our papers:

+

You can find more about the research behind Spark in the following papers: