0% found this document useful (0 votes)

78 views19 pages

Spark Introduction

Spark is a fast and general engine for large-scale data processing. It is 100x faster than Hadoop MapReduce for in-memory processing and 10x faster on disk. Spark addresses shortcomings of MapReduce like its batch-oriented nature, difficulty converting logic to MapReduce, and inability to do in-memory computing. Spark provides interactive shells for Scala, Python, R and a spark-submit command to execute applications on a cluster.

Uploaded by

alpha0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views19 pages

Spark Introduction

Uploaded by

alpha0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Introduction

• Really fast MapReduce

• 100x faster than Hadoop MapReduce in memory,
• 10x faster on disk.
• Builds on similar paradigms as MapReduce
• Integrated with Hadoop

Spark Core - A fast and general engine for large-scale

data processing.

Introduction
Spark Architecture
Data Sources

HDFS

HBase

SQL SparkR Java Python Scala Languages

Hive
Dataframes Streaming MLLib GraphX Libraries

Tachyon

Spark Core
Cassandra

Hadoop
Amazon EC2 Standalone Apache Mesos
YARN

Resource/cluster managers

Introduction
Why Apache Spark?
Or
Why is Apache Spark faster than MapReduce?
Why Apache Spark?

Hadoop Map Reduce

Read
Input

● User Sends Logic

Map() Map ● In form of Map() & Reduces
Reduce() HDFS ● Tries to do execute near data
Reduce
● Saves result to HDFS
Write
Output

Introduction
Hadoop Map Reduce - Multiple Phases

Map Write Map

HDFS HDFS HDFS
Reduce - 1 Reduce - 2

Introduction
Shortcoming of Map Reduce

1. Batchwise Design
a. Every map-reduce cycle reads from and writes to
HDFS
b. Heavy Latency
2. Converting logic to Map-Reduce paradigm is difficult
3. In-memory computing was not possible

Introduction
Shortcoming of Map Reduce

Map Write Map

RAM RAM RAM
Reduce - 1 Reduce - 2
80 times faster
than disk

Latency Numbers Every Programmer Should Know

See: https://gist.github.com/jboner/2841832

Introduction
Getting Started - CloudxLab

We have already installed the Apache Spark on CloudxLab.

So, you don't have install anything.

You simply have to login into Web Console

and
Get started with commands.

Introduction
Getting Started - Downloading
1. Find out hadoop version:
○ [student@hadoop1 ~]$ hadoop version
○ Hadoop 2.4.0.2.1.4.0-632
2. Go to https://spark.apache.org/downloads.html
3. Select the release for your version of hadoop & Download
4. On servers you could use wget
5. Every download can be run in standalone mode
6. Unzip - tar -xzvf spark*.tgz
7. In this folder, the bin folder contains the spark commands

Introduction
Getting Started - Binaries Overview

Binary Description
spark-shell Runs spark scala interactive commandline
pyspark Runs python spark interactive commandline
sparkR Runs R on spark (/usr/spark2.6/bin/sparkR)
spark-submit Submit a jar or python application for execution on cluster
spark-sql Runs the spark sql interactive shell

Introduction
Starting Spark With Scala Interactive Shell

$ spark-shell

It is basically the scala REPL or interactive shell with one extra variable “sc”.
Check dir(sc) or help(sc)

Introduction
Starting Spark With Python Interactive Shell

$ pyspark

It is basically the python interactive shell

with one extra variable “sc”.
Check dir(sc) or help(sc)
Introduction
Getting Started - spark-submit
● To run example:
○ spark-submit --class org.apache.spark.examples.SparkPi
/usr/hdp/current/spark-client/lib/spark-examples-*.jar 10

The example computes the area of circle of a radius 1 by counting total number of
squares.
○ See https://en.wikipedia.org/wiki/Approximations_of_%CF%80#Summing_a_circle.27s_area
○ Code:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala

Introduction
Getting Started - spark-submit

Introduction
Getting Started - Binaries Overview

Introduction
Getting Started - CloudxLab

To launch Spark on Hadoop,

Set the Environment Variables pointing to Hadoop.

export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/

Introduction
Getting Started - CloudxLab

We have installed other versions too:

1. /usr/spark2.0.1/bin/spark-shell
2. /usr/spark1.6/bin/spark-shell
3. /usr/spark1.2.1/bin/spark-shell

Introduction
Introduction

Thank you!

Lec - Spark
No ratings yet
Lec - Spark
65 pages
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Cisco SD Wan
100% (1)
Cisco SD Wan
185 pages
Module 13: ICMP: Instructor Materials
No ratings yet
Module 13: ICMP: Instructor Materials
27 pages
Spark
No ratings yet
Spark
96 pages
Mastering Apache Spark PDF
75% (4)
Mastering Apache Spark PDF
541 pages
Spark
No ratings yet
Spark
160 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Apache Spark Tutorial
100% (1)
Apache Spark Tutorial
6 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
Daily Saf Encoded Monitoring Updated
No ratings yet
Daily Saf Encoded Monitoring Updated
631 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
Nodeb Data Configuration: Internal
No ratings yet
Nodeb Data Configuration: Internal
57 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Pyspark Tutorial
100% (2)
Pyspark Tutorial
27 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Unit 5
100% (1)
Unit 5
109 pages
MPLS Basics - PPT
No ratings yet
MPLS Basics - PPT
31 pages
Apache Spark
No ratings yet
Apache Spark
40 pages
Unit V
No ratings yet
Unit V
23 pages
Unit 4 Spark Updated
No ratings yet
Unit 4 Spark Updated
86 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
ACI Service Graph Design PDF
No ratings yet
ACI Service Graph Design PDF
101 pages
Unit 5.1
No ratings yet
Unit 5.1
9 pages
Shark
No ratings yet
Shark
24 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Soal Mikrotik
No ratings yet
Soal Mikrotik
3 pages
Spark Overview: Security
No ratings yet
Spark Overview: Security
4 pages
Introduction-to-Apache-Spark
No ratings yet
Introduction-to-Apache-Spark
22 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Bda 5
No ratings yet
Bda 5
21 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Learning Spark - Chapter 1
No ratings yet
Learning Spark - Chapter 1
18 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Bda U4
No ratings yet
Bda U4
49 pages
Spark Interview Questions PDF 2
No ratings yet
Spark Interview Questions PDF 2
19 pages
Learning Spark - Chapter 2
No ratings yet
Learning Spark - Chapter 2
6 pages
8 PDFsam Apache Spark Tutorial
No ratings yet
8 PDFsam Apache Spark Tutorial
7 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Module 2
No ratings yet
Module 2
20 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Spark Interview Questions PDF 2
No ratings yet
Spark Interview Questions PDF 2
19 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
RIPE Whois Database Queries Reference Card
100% (4)
RIPE Whois Database Queries Reference Card
2 pages
How To Set Up A SOCKS Proxy Using Putty & SSH
No ratings yet
How To Set Up A SOCKS Proxy Using Putty & SSH
3 pages
IP Datagram Problem
No ratings yet
IP Datagram Problem
13 pages
Chapter 2
No ratings yet
Chapter 2
47 pages
Lab Exer 8
No ratings yet
Lab Exer 8
8 pages
BGP MPLS Layer 3 VPNs
No ratings yet
BGP MPLS Layer 3 VPNs
5 pages
Ccna R&S Demo
No ratings yet
Ccna R&S Demo
18 pages
11.5.1.3 Packet Tracer - Troubleshooting Challenge - ILM
No ratings yet
11.5.1.3 Packet Tracer - Troubleshooting Challenge - ILM
3 pages
Frequently Asked Questions - Networks
No ratings yet
Frequently Asked Questions - Networks
2 pages
CIIT Vehari Campus: The Internet and WWW: E-Commerce Infrastructure
No ratings yet
CIIT Vehari Campus: The Internet and WWW: E-Commerce Infrastructure
15 pages
PDFfiller - PNC Registration Form PDF Download
No ratings yet
PDFfiller - PNC Registration Form PDF Download
770 pages
C Data FD1608S B1 User Manual Command Reference 0925
No ratings yet
C Data FD1608S B1 User Manual Command Reference 0925
374 pages
BGP Peer Templates
No ratings yet
BGP Peer Templates
11 pages
CSE 205 Final Exam - PhamSonBinh 1831200035
No ratings yet
CSE 205 Final Exam - PhamSonBinh 1831200035
23 pages
Omn - Switch Release8 Specification Guide
No ratings yet
Omn - Switch Release8 Specification Guide
78 pages
Lesson Plan MWT
No ratings yet
Lesson Plan MWT
4 pages
Anr 6.12.6 (61266322) 20200201 091817
No ratings yet
Anr 6.12.6 (61266322) 20200201 091817
11 pages
Datasheet - TZ80
No ratings yet
Datasheet - TZ80
7 pages
Routing Dinamis (Rip, Igrp, Eigrp, Ospf)
No ratings yet
Routing Dinamis (Rip, Igrp, Eigrp, Ospf)
26 pages
Lib Burst Generated
No ratings yet
Lib Burst Generated
8 pages
Lab 19 (DHCP)
No ratings yet
Lab 19 (DHCP)
7 pages
34-Suppress Map
No ratings yet
34-Suppress Map
3 pages
Config DTN MGMT SW Par DGTN
No ratings yet
Config DTN MGMT SW Par DGTN
4 pages
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet

Spark Introduction

Uploaded by

Spark Introduction

Uploaded by

Introduction

• Really fast MapReduce

Spark Core - A fast and general engine for large-scale

SQL SparkR Java Python Scala Languages

Hadoop Map Reduce

● User Sends Logic

Map Write Map

Map Write Map

Latency Numbers Every Programmer Should Know

We have already installed the Apache Spark on CloudxLab.

You simply have to login into Web Console

It is basically the python interactive shell

To launch Spark on Hadoop,

We have installed other versions too:

You might also like