0% found this document useful (0 votes)

141 views8 pages

Apache Spark Primer 170303

Apache Spark Primer

Uploaded by

selives

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

141 views8 pages

Apache Spark Primer 170303

Apache Spark Primer

Uploaded by

selives

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Apache

Spark
™

Primer
What is Apache Spark™?
Apache Spark is an open source data processing engine built for speed, ease of use, and sophisticated
analytics. Since its release, Spark has seen rapid adoption by enterprises across a wide range of
industries. Internet powerhouses such as Netflix, Yahoo, Baidu, and eBay have eagerly deployed Spark
at massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes.
Meanwhile, it has become the largest open source community in big data, with over 1,000 contributors
from 250+ organizations. Together with the Spark community, Databricks continues to contribute
heavily to the Apache Spark project, through both development and community evangelism.

What is Apache Spark used for?

As a general purpose compute engine designed for distributed processing, Spark is used for many
types of data processing. It supports ETL, interactive queries (SQL), advanced analytics (e.g. machine
learning) and structured streaming over large datasets. For loading and storing data, Spark integrates
with many storage systems (e.g. HDFS, Cassandra, MySQL, HBase, MongoDB, S3). Spark is also
pluggable, with dozens of applications, data sources, and environments, forming an extensible open-
source ecosystem. Additionally, Spark supports a variety of popular development languages including
R, Java, Python and Scala.

Sparkling

Environments Applications
YARN A unified engine across
data sources,
applications,
DataFrames / SQL / Datasets APIs and environments.

Spark SQL Spark Streaming MLlib GraphX

RDD API

Spark Core

Data Sources

{JSON}

2
“
At Databricks, we’re working hard to make Spark easier to use and run than ever, through our efforts on both the

”
Spark codebase and support materials around it. All of our work on Spark is open source and goes directly to
Apache.

—Matei Zaharia, VP, Apache Spark, Founder & CTO, Databricks

How does Apache Spark work?

Spark takes programs written in a high-level concise language and distributes the execution of
its tasks onto many machines. It achieves this through an API such as DataFrames and Datasets
built atop Resilient Distributed Datasets (RDDs) — a distributed dataset abstraction that performs
calculations on large clusters in a fault-tolerant manner.

Spark’s architecture differs from earlier approaches in several ways that improves its performance
significantly. First, Spark allows users to take advantage of memory-centric computing architectures
by persisting DataFrames, Datasets, and RDDs in-memory, enabling fast iterative processing use
cases such as interactive querying or machine learning. Second, Spark’s high level DataFrames and
Datasets APIs also enable further intelligent optimization of user programs. Third, Project Tungsten
and Catalyst Optimizer as part of the Spark SQL engine significantly boost Spark’s execution speed in
many cases by 5-10X.

However, sheer performance is not the only distinctive feature of Spark. Its true power lies in
unity and versatility. Spark unifies previously disparate functionalities including batch processing,
advanced analytics, interactive exploration, and real-time stream processing into a single unified data
processing framework.

Apache Spark
Spark Spark MLlib GraphX Spark R Components
SQL Streaming Machine Graph R on Spark
Streaming Learning Computation

Spark Core Engine

3
What are the benefits of Apache Spark?
Spark was initially designed for interactive queries and iterative algorithmic computation, as these were
two major use cases not well served by batch frameworks like MapReduce. Consequently, Spark excels
in scenarios that require fast performance, such as iterative processing, interactive querying, batch and
real-time streaming data processing, and graph computations. Developers and enterprises deploy Spark
because of its inherent benefits:

Simple Simplified and Unified Engine

Easy-to-use, high-level declarative and unified Apache Spark is packaged with unified higher-
APIs for operating on large datasets. This level API libraries for DataFrames/Datasets,
includes a collection of over 100 operators including support for SQL queries, structured
for transforming data, familiar DataFrame/ streaming, machine learning and graph
Dataset domain-specific APIs for manipulating processing. These standard libraries increase
structured or semi-structured data, and a single developer productivity and can be seamlessly
point of entry for Spark applications to interact combined to create complex workflows.
with Spark.
Through unified DataFrames/Datasets built atop
SQL Engine and extended to Spark streaming and
Speed Machine Learning MLlib, developers can write
Engineered from the bottom-up for performance, end-to-end continuous applications, where they
running 100x faster than Apache® Hadoop™ by can perform advanced analytics on both static
exploiting in memory computing and Tungsten’s and continuous data (including real-time).
and Catalyst’s code optimizations. Spark is also
fast when data is stored on disk, and in 2014
Spark set the world record for large-scale Integrate Broadly
on-disk sorting. Built-in support for many data sources, such
as HDFS, Kafka, RDBMS, S3, Cassandra, and
MongoDB, and data formats, such as Parquet,
JSON, CSV, TXT, and ORC.

4
What is the relationship between
Apache Spark and Apache Hadoop?
Spark is bigger than Hadoop in adoption and widely used outside of Hadoop environments, since the Spark
engine has no required dependency on the Hadoop stack. Around half of Spark users don’t use Hadoop but
run directly against key-value store or cloud storage. For instance, companies use Spark to crunch data in
“NoSQL” data stores such as Cassandra and MongoDB, cloud storage offerings like Amazon S3, or traditional
RDBMS data warehouses.

In the broader context of the Hadoop ecosystem, Spark can interoperate seamlessly with the Hadoop stack.
It can read from any input source that MapReduce supports, ingest data directly from Apache Hive
warehouses, and runs on top of the Apache Hadoop YARN resource manager.

In the narrower context of the Hadoop MapReduce processing engine, Spark represents a modern alternative
to MapReduce, based on a more performance oriented and feature rich design. In many organizations, Spark
has succeeded MapReduce as the engine of choice for new projects, especially for projects involving multiple
processing models and workloads or where performance is mission critical. Spark is also evolving much
more rapidly than MapReduce, with significant feature additions occurring on a regular basis.

“
MapReduce is an implementation of a design that was created more than 15 years ago. Apache Spark is a

”
from-scratch reimagined or re-architecting of what you want out of an execution engine given today’s hardware.

—Patrick Wendell, Founding Committer, Apache Spark & Co-founder, VP of Engineering, Databricks

5
What are some common Apache
Spark use cases?
Because of its unique combination of performance and versatility, over a 1000 organizations, in many
industries across a wide range of use cases, have deployed Spark. While innovators are constantly
deploying Spark in creative and disruptive ways, common use cases include:

Data integration and ETL Machine learning and advanced analytics

Cleansing and combining data from diverse The application of sophisticated algorithms to
sources for visualization or processing predict outcomes, detecting fraud, inferring
or analyzing in the future. Examples hidden information, or making decisions based
include Edmund’s use of data integrity for on input data. Examples include Riot Games’
improved customer experience and Netflix’s massive gaming data for advanced analytics
productionizing ETL at petascale. in the gaming sector, Alibaba’s analysis of its
marketplace in the retail sector, and Spotify’s
music recommendation engine in the media
Interactive analytics or business
sector.
intelligence
Gaining insight from massive data sets to
inform product or business decisions in ad hoc Real-time data processing
investigations or regularly planned dashboards. Capturing and processing data continuously with
Examples include DNV GL’s predictive analytics low latency and high reliability. Examples include
in the energy sector, Goldman Sachs’ analytics Automatic’s real-time analytics for smarter cars,
platform in the financial sector, and Huawei’s Convivas’ near real-time and offline analysis for
query platform in the telecom sector. the online video business of their customers,
Netflix’s streaming recommendation engine in the
media sector, and British Gas’ connected homes
High performance computation
in the energy sector.
Reducing time to run complex algorithms against
large scale data. Examples include MyFitnessPal
/ UnderArmour’s food database in the health &
wellness sector and Novartis’ genomic research
in the pharma sector.

6
Who is Databricks?
Databricks’ vision is to empower anyone to easily build and deploy advanced analytics solutions.
The company was founded by the team who created Apache Spark™, a powerful open source data
processing engine built for sophisticated analytics, ease of use, and speed. Databricks is the largest
contributor to the open source Apache Spark project providing 10x more code than any other
company. The company has also trained over 40,000 users on Apache Spark, and has the largest
number of customers deploying Spark to date. Databricks provides a virtual analytics data platform,
to simplify data integration, real-time experimentation, and robust deployment of production
applications.

For more information on Databricks, download the Databricks Primer.

Try Apache Spark on Databricks for free or contact us

for a personalized demo.

7
Try Apache Spark on Databricks for free
databricks.com/try-databricks

Contact us for a personalized demo

databricks.com/contact-databricks

About Databricks:
Databricks’ mission is to accelerate innovation for its customers by unifying Data Science, Engineering and Business. Founded by the team who created Apache Spark™,
Databricks provides a Unified Analytics Platform for data science teams to collaborate with data engineering and lines of business to build data products. Users achieve faster
time-to-value with Databricks by creating analytic workflows that go from ETL and interactive exploration to production. The company also makes it easier for its users to focus
on their data by providing a fully managed, scalable, and secure cloud infrastructure that reduces operational complexity and total cost of ownership. Databricks, venture-
backed by Andreessen Horowitz and NEA, has a global customer base that includes, Salesforce, Viacom, Amgen, Shell and HP. For more information, visit www.databricks.com.

DE Shaw OA Round
No ratings yet
DE Shaw OA Round
7 pages
Machine Learning Handwritten Notes
No ratings yet
Machine Learning Handwritten Notes
49 pages
Mastering Apache Spark PDF
50% (2)
Mastering Apache Spark PDF
1,352 pages
Basics of Computer Networking
100% (3)
Basics of Computer Networking
6 pages
Python Data Visualization: 2019 Tools and Trends
No ratings yet
Python Data Visualization: 2019 Tools and Trends
22 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Ebook Accelerating Apache Spark 3
No ratings yet
Ebook Accelerating Apache Spark 3
108 pages
Learning Spark - Chapter 1
No ratings yet
Learning Spark - Chapter 1
18 pages
Spark SQL
100% (1)
Spark SQL
34 pages
Financial Asset Pricing Theory 2007
No ratings yet
Financial Asset Pricing Theory 2007
335 pages
Introduction-to-Apache-Spark
No ratings yet
Introduction-to-Apache-Spark
22 pages
Big Data Computing Spark Built-In Libraries
No ratings yet
Big Data Computing Spark Built-In Libraries
11 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
An Introduction To Malliavin Calculus With Applications To Economics
No ratings yet
An Introduction To Malliavin Calculus With Applications To Economics
83 pages
CS609 - Final Term Solved Subjective With References by Moaaz
No ratings yet
CS609 - Final Term Solved Subjective With References by Moaaz
17 pages
Delphi IO Errors
No ratings yet
Delphi IO Errors
91 pages
Credit Approval Process tcm16-23748 PDF
No ratings yet
Credit Approval Process tcm16-23748 PDF
103 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Oracle Install Base Overview
No ratings yet
Oracle Install Base Overview
18 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
200 pages
Snort Project
No ratings yet
Snort Project
12 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
SQL 1 / 6 Dbschema: Varchar2
No ratings yet
SQL 1 / 6 Dbschema: Varchar2
6 pages
M1 - Introducing Google Cloud v5.2 - ILT
No ratings yet
M1 - Introducing Google Cloud v5.2 - ILT
69 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Pyspark Vs Spark SQL
No ratings yet
Pyspark Vs Spark SQL
6 pages
Breaking Code Silence Ediscovery Order
No ratings yet
Breaking Code Silence Ediscovery Order
22 pages
Tugas 2 SMBD
No ratings yet
Tugas 2 SMBD
11 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
Assignment-Practical List XII 2022-23
No ratings yet
Assignment-Practical List XII 2022-23
3 pages
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
PIC16F87XA Memory Organization Tutorial
No ratings yet
PIC16F87XA Memory Organization Tutorial
13 pages
Co-Computer Registers
No ratings yet
Co-Computer Registers
2 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Scala Notes
No ratings yet
Scala Notes
71 pages
Dynamodb DG
No ratings yet
Dynamodb DG
705 pages
Polars Vs Pandas - Benchmarking Performances and Beyond - LinkedIn
No ratings yet
Polars Vs Pandas - Benchmarking Performances and Beyond - LinkedIn
12 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Walkman NWZ S740
No ratings yet
Walkman NWZ S740
5 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
Machine Learning + Devops Using Azure ML Services
No ratings yet
Machine Learning + Devops Using Azure ML Services
17 pages
Problem Number: 05 Problem Title:: Write Down The SQL, Expressions For The Following Queries
No ratings yet
Problem Number: 05 Problem Title:: Write Down The SQL, Expressions For The Following Queries
6 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
hp-ProCurve Switch 2610 Series
No ratings yet
hp-ProCurve Switch 2610 Series
9 pages
Scala Currying
No ratings yet
Scala Currying
13 pages
Data Mining Cheat Sheet PDF
No ratings yet
Data Mining Cheat Sheet PDF
6 pages
h18456 Dell Emc Powerscale Data Protection With Avamar NDMP Accelerator
No ratings yet
h18456 Dell Emc Powerscale Data Protection With Avamar NDMP Accelerator
2 pages
Apache Cassandra
No ratings yet
Apache Cassandra
3 pages
GPU Computing With Spark and Python
No ratings yet
GPU Computing With Spark and Python
33 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
01-Docker - 02 - Install Docker Desktop On Windows
No ratings yet
01-Docker - 02 - Install Docker Desktop On Windows
6 pages
Stream Processing at Lyft
No ratings yet
Stream Processing at Lyft
20 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Dataeng-Zoomcamp - 4 - Analytics - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
No ratings yet
Dataeng-Zoomcamp - 4 - Analytics - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
26 pages
Qlik Sense Installation Guide
No ratings yet
Qlik Sense Installation Guide
63 pages
Essential Python Libraries and Functions For Data Science 1706295212
No ratings yet
Essential Python Libraries and Functions For Data Science 1706295212
12 pages
Installation of Amos Business Suite 9.2.07 Client
No ratings yet
Installation of Amos Business Suite 9.2.07 Client
13 pages
Chapter 3 Sample Questions
No ratings yet
Chapter 3 Sample Questions
2 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
HarmanjotSingh Lab1
No ratings yet
HarmanjotSingh Lab1
58 pages
ISYS6307 Data & Information Management: Week 5 Class OOP PHP Native
No ratings yet
ISYS6307 Data & Information Management: Week 5 Class OOP PHP Native
38 pages
TF On Spark
No ratings yet
TF On Spark
35 pages
Pyspark With Docker
100% (1)
Pyspark With Docker
15 pages
Unified Batch and Real Time Stream Processing
No ratings yet
Unified Batch and Real Time Stream Processing
68 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Proxy Rental API Functions
No ratings yet
Proxy Rental API Functions
49 pages
Ramnit Analysis
No ratings yet
Ramnit Analysis
14 pages
DWH Fundamentals
No ratings yet
DWH Fundamentals
63 pages
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
No ratings yet
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
11 pages
Machine Learning Spark ML
No ratings yet
Machine Learning Spark ML
11 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
(IJIT-V6I5P7) :ravishankar Belkunde
No ratings yet
(IJIT-V6I5P7) :ravishankar Belkunde
9 pages
Data Science Learning Path For 50 Days
No ratings yet
Data Science Learning Path For 50 Days
15 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
RSSI Stands For Received Signal Strength Indicator. It Is An Estimated Measure of Power
No ratings yet
RSSI Stands For Received Signal Strength Indicator. It Is An Estimated Measure of Power
2 pages
Proforma Usage of Protable Storage Devices (Pen Drive) : TC Code TC Name TC City
No ratings yet
Proforma Usage of Protable Storage Devices (Pen Drive) : TC Code TC Name TC City
1 page
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
StarWarsTheImperialMarchDarthVadersTheme Archive
No ratings yet
StarWarsTheImperialMarchDarthVadersTheme Archive
1 page
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Big List of Naughty Strings
No ratings yet
Big List of Naughty Strings
14 pages
Problem Description: Sensitivity: Internal & Restricted
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
2 pages
Hands-On Learning With KubeFlow + Keras - TensorFlow 2.0 + TF Extended
No ratings yet
Hands-On Learning With KubeFlow + Keras - TensorFlow 2.0 + TF Extended
1 page
Elasticsearch Developer Cheat Sheet PDF
No ratings yet
Elasticsearch Developer Cheat Sheet PDF
2 pages
Lab 2
No ratings yet
Lab 2
5 pages
Minimum Cardinality Refers To - .: Communicationsunit@unilag - Edu.ng
No ratings yet
Minimum Cardinality Refers To - .: Communicationsunit@unilag - Edu.ng
4 pages
SQL Interview Questions and Answers: What Is SQL and Where Does It Come From?
No ratings yet
SQL Interview Questions and Answers: What Is SQL and Where Does It Come From?
9 pages
EARLYEDUjunji 2005
No ratings yet
EARLYEDUjunji 2005
69 pages

Apache Spark Primer 170303

Uploaded by

Apache Spark Primer 170303

Uploaded by

Apache

What is Apache Spark used for?

Spark SQL Spark Streaming MLlib GraphX

—Matei Zaharia, VP, Apache Spark, Founder & CTO, Databricks

How does Apache Spark work?

Spark Core Engine

Simple Simplified and Unified Engine

Data integration and ETL Machine learning and advanced analytics

For more information on Databricks, download the Databricks Primer.

Try Apache Spark on Databricks for free or contact us

Contact us for a personalized demo

You might also like

—Matei Zaharia, VP, Apache Spark, Founder & CTO, Databricks