0% found this document useful (0 votes)
39 views13 pages

Introduction To Apache Spark

The document provides an introduction to Apache Spark, detailing its genesis as a solution to the shortcomings of Hadoop in handling big data and distributed computing. It describes Spark as a unified engine for large-scale data processing, emphasizing its speed, ease of use, and modularity. Additionally, it outlines various use cases for Spark, including data science, machine learning, and real-time data processing.

Uploaded by

azamsyed811
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views13 pages

Introduction To Apache Spark

The document provides an introduction to Apache Spark, detailing its genesis as a solution to the shortcomings of Hadoop in handling big data and distributed computing. It describes Spark as a unified engine for large-scale data processing, emphasizing its speed, ease of use, and modularity. Additionally, it outlines various use cases for Spark, including data science, machine learning, and real-time data processing.

Uploaded by

azamsyed811
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Introduction to Apache Spark

Outline
q The Genesis of Spark

q What is Apache Spark?

q Getting Started with Spark

Reference:
• Chapter 1, “Learning Spark”, 2nd Edition. Authors: Jules S. Damji, Brooke Wenig,
Tathagata Das, Denny Lee. Publisher(s): O'Reilly Media, Inc. ISBN: 9781492050049
2
3
The Genesis of Spark
• Big Data and Distributed Computing at Google
o creation of the Google File System (GFS), MapReduce (MR), and Bigtable to handle
massive amount of data on the Internet

• Hadoop at Yahoo!
o Open-source community – especially, Yahoo! was also interested
o GFS provided a blueprint for the Hadoop File System (HDFS)
o Donated to the Apache
o Shortcomings: administration and management, complex operation, low fault
tolerance of MapReduce, slow MR jobs

• Spark was developed to address the issues Hadoop had

4
The Genesis of Spark
• Spark was developed to address the issues Hadoop had

Intermittent iteration of reads and writes between map and reduce computations

5
What Is Apache Spark?
● Apache Spark is a unified engine
designed for large-scale distributed
data processing, on premises in data
centers or in the cloud.
● Design philosophy:
○ Speed
○ Ease of use
○ Modularity
○ Extensibility

Apache Spark’s ecosystem of connectors

6
What Is Apache Spark?
Structured Real-time Common Analyze
data processing of Machine graphs and
(e.g., CSV, text, continually learning topologies
JSON, Avro, growing table algorithms using
ORC, Parquet) algorithms e.g.,
PageRank

Apache Spark components and API stack


8
Spark SQL
• Read from a JSON file stored on Amazon S3
• Create a temporary table, and
• Issue a SQL-like query on the results read into memory as a Spark DataFrame

9
Who Uses Spark, and for What?
Data Science, Data Engineering, Machine Learning

Some use cases:

• Processing in parallel large data sets distributed across a cluster

• Performing ad hoc or interactive queries to explore and visualize data sets

• Building, training, and evaluating ML models using MLlib

• Implementing end-to-end data pipelines from myriad streams of data

• Analyzing graph data sets and social networks

10
Basic Operations a Data Scientist May Perform

11
Spark Ecosystem

12
Spark’s Distributed Execution

13
Spark Installation

14
Spark – Databricks Community Edition
1. Create a free Databricks account using this link:
https://databricks.com/try-databricks

2. When asked to select a cloud provider, click "Get


started with Community Edition" towards the bottom
(see screenshot)

3. Verify your email account by clicking the link sent to


your email. Then log in here:
https://community.cloud.databricks.com/login.html

15

You might also like