Apache Spark Primer 170303
Apache Spark Primer 170303
Spark
™
Primer
What is Apache Spark™?
Apache Spark is an open source data processing engine built for speed, ease of use, and sophisticated
analytics. Since its release, Spark has seen rapid adoption by enterprises across a wide range of
industries. Internet powerhouses such as Netflix, Yahoo, Baidu, and eBay have eagerly deployed Spark
at massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes.
Meanwhile, it has become the largest open source community in big data, with over 1,000 contributors
from 250+ organizations. Together with the Spark community, Databricks continues to contribute
heavily to the Apache Spark project, through both development and community evangelism.
Sparkling
Environments Applications
YARN A unified engine across
data sources,
applications,
DataFrames / SQL / Datasets APIs and environments.
RDD API
Spark Core
Data Sources
S3
{JSON}
2
“
At Databricks, we’re working hard to make Spark easier to use and run than ever, through our efforts on both the
”
Spark codebase and support materials around it. All of our work on Spark is open source and goes directly to
Apache.
Spark’s architecture differs from earlier approaches in several ways that improves its performance
significantly. First, Spark allows users to take advantage of memory-centric computing architectures
by persisting DataFrames, Datasets, and RDDs in-memory, enabling fast iterative processing use
cases such as interactive querying or machine learning. Second, Spark’s high level DataFrames and
Datasets APIs also enable further intelligent optimization of user programs. Third, Project Tungsten
and Catalyst Optimizer as part of the Spark SQL engine significantly boost Spark’s execution speed in
many cases by 5-10X.
However, sheer performance is not the only distinctive feature of Spark. Its true power lies in
unity and versatility. Spark unifies previously disparate functionalities including batch processing,
advanced analytics, interactive exploration, and real-time stream processing into a single unified data
processing framework.
Apache Spark
Spark Spark MLlib GraphX Spark R Components
SQL Streaming Machine Graph R on Spark
Streaming Learning Computation
3
What are the benefits of Apache Spark?
Spark was initially designed for interactive queries and iterative algorithmic computation, as these were
two major use cases not well served by batch frameworks like MapReduce. Consequently, Spark excels
in scenarios that require fast performance, such as iterative processing, interactive querying, batch and
real-time streaming data processing, and graph computations. Developers and enterprises deploy Spark
because of its inherent benefits:
4
What is the relationship between
Apache Spark and Apache Hadoop?
Spark is bigger than Hadoop in adoption and widely used outside of Hadoop environments, since the Spark
engine has no required dependency on the Hadoop stack. Around half of Spark users don’t use Hadoop but
run directly against key-value store or cloud storage. For instance, companies use Spark to crunch data in
“NoSQL” data stores such as Cassandra and MongoDB, cloud storage offerings like Amazon S3, or traditional
RDBMS data warehouses.
In the broader context of the Hadoop ecosystem, Spark can interoperate seamlessly with the Hadoop stack.
It can read from any input source that MapReduce supports, ingest data directly from Apache Hive
warehouses, and runs on top of the Apache Hadoop YARN resource manager.
In the narrower context of the Hadoop MapReduce processing engine, Spark represents a modern alternative
to MapReduce, based on a more performance oriented and feature rich design. In many organizations, Spark
has succeeded MapReduce as the engine of choice for new projects, especially for projects involving multiple
processing models and workloads or where performance is mission critical. Spark is also evolving much
more rapidly than MapReduce, with significant feature additions occurring on a regular basis.
“
MapReduce is an implementation of a design that was created more than 15 years ago. Apache Spark is a
”
from-scratch reimagined or re-architecting of what you want out of an execution engine given today’s hardware.
—Patrick Wendell, Founding Committer, Apache Spark & Co-founder, VP of Engineering, Databricks
5
What are some common Apache
Spark use cases?
Because of its unique combination of performance and versatility, over a 1000 organizations, in many
industries across a wide range of use cases, have deployed Spark. While innovators are constantly
deploying Spark in creative and disruptive ways, common use cases include:
6
Who is Databricks?
Databricks’ vision is to empower anyone to easily build and deploy advanced analytics solutions.
The company was founded by the team who created Apache Spark™, a powerful open source data
processing engine built for sophisticated analytics, ease of use, and speed. Databricks is the largest
contributor to the open source Apache Spark project providing 10x more code than any other
company. The company has also trained over 40,000 users on Apache Spark, and has the largest
number of customers deploying Spark to date. Databricks provides a virtual analytics data platform,
to simplify data integration, real-time experimentation, and robust deployment of production
applications.
7
Try Apache Spark on Databricks for free
databricks.com/try-databricks
About Databricks:
Databricks’ mission is to accelerate innovation for its customers by unifying Data Science, Engineering and Business. Founded by the team who created Apache Spark™,
Databricks provides a Unified Analytics Platform for data science teams to collaborate with data engineering and lines of business to build data products. Users achieve faster
time-to-value with Databricks by creating analytic workflows that go from ETL and interactive exploration to production. The company also makes it easier for its users to focus
on their data by providing a fully managed, scalable, and secure cloud infrastructure that reduces operational complexity and total cost of ownership. Databricks, venture-
backed by Andreessen Horowitz and NEA, has a global customer base that includes, Salesforce, Viacom, Amgen, Shell and HP. For more information, visit www.databricks.com.
170623
© Databricks 2017. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation. Privacy Policy | Terms of Use