0% found this document useful (0 votes)
78 views19 pages

Spark Introduction

Spark is a fast and general engine for large-scale data processing. It is 100x faster than Hadoop MapReduce for in-memory processing and 10x faster on disk. Spark addresses shortcomings of MapReduce like its batch-oriented nature, difficulty converting logic to MapReduce, and inability to do in-memory computing. Spark provides interactive shells for Scala, Python, R and a spark-submit command to execute applications on a cluster.

Uploaded by

alpha0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views19 pages

Spark Introduction

Spark is a fast and general engine for large-scale data processing. It is 100x faster than Hadoop MapReduce for in-memory processing and 10x faster on disk. Spark addresses shortcomings of MapReduce like its batch-oriented nature, difficulty converting logic to MapReduce, and inability to do in-memory computing. Spark provides interactive shells for Scala, Python, R and a spark-submit command to execute applications on a cluster.

Uploaded by

alpha0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Introduction

• Really fast MapReduce


• 100x faster than Hadoop MapReduce in memory,
• 10x faster on disk.
• Builds on similar paradigms as MapReduce
• Integrated with Hadoop

Spark Core - A fast and general engine for large-scale


data processing.

Introduction
Spark Architecture
Data Sources

HDFS

HBase

SQL SparkR Java Python Scala Languages

Hive
Dataframes Streaming MLLib GraphX Libraries

Tachyon

Spark Core
Cassandra

Hadoop
Amazon EC2 Standalone Apache Mesos
YARN

Resource/cluster managers

Introduction
Why Apache Spark?
Or
Why is Apache Spark faster than MapReduce?
Why Apache Spark?

Hadoop Map Reduce

Read
Input

● User Sends Logic


Map() Map ● In form of Map() & Reduces
Reduce() HDFS ● Tries to do execute near data
Reduce
● Saves result to HDFS
Write
Output

Introduction
Hadoop Map Reduce - Multiple Phases

Map Write Map


HDFS HDFS HDFS
Reduce - 1 Reduce - 2

Introduction
Shortcoming of Map Reduce

1. Batchwise Design
a. Every map-reduce cycle reads from and writes to
HDFS
b. Heavy Latency
2. Converting logic to Map-Reduce paradigm is difficult
3. In-memory computing was not possible

Introduction
Shortcoming of Map Reduce

Map Write Map


RAM RAM RAM
Reduce - 1 Reduce - 2
80 times faster
than disk

Latency Numbers Every Programmer Should Know

See: https://gist.github.com/jboner/2841832

Introduction
Getting Started - CloudxLab

We have already installed the Apache Spark on CloudxLab.


So, you don't have install anything.

You simply have to login into Web Console


and
Get started with commands.

Introduction
Getting Started - Downloading
1. Find out hadoop version:
○ [student@hadoop1 ~]$ hadoop version
○ Hadoop 2.4.0.2.1.4.0-632
2. Go to https://spark.apache.org/downloads.html
3. Select the release for your version of hadoop & Download
4. On servers you could use wget
5. Every download can be run in standalone mode
6. Unzip - tar -xzvf spark*.tgz
7. In this folder, the bin folder contains the spark commands

Introduction
Getting Started - Binaries Overview

Binary Description
spark-shell Runs spark scala interactive commandline
pyspark Runs python spark interactive commandline
sparkR Runs R on spark (/usr/spark2.6/bin/sparkR)
spark-submit Submit a jar or python application for execution on cluster
spark-sql Runs the spark sql interactive shell

Introduction
Starting Spark With Scala Interactive Shell

$ spark-shell

It is basically the scala REPL or interactive shell with one extra variable “sc”.
Check dir(sc) or help(sc)

Introduction
Starting Spark With Python Interactive Shell

$ pyspark

It is basically the python interactive shell


with one extra variable “sc”.
Check dir(sc) or help(sc)
Introduction
Getting Started - spark-submit
● To run example:
○ spark-submit --class org.apache.spark.examples.SparkPi
/usr/hdp/current/spark-client/lib/spark-examples-*.jar 10

The example computes the area of circle of a radius 1 by counting total number of
squares.
○ See https://en.wikipedia.org/wiki/Approximations_of_%CF%80#Summing_a_circle.27s_area
○ Code:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala

Introduction
Getting Started - spark-submit

Introduction
Getting Started - Binaries Overview

Binary Description
spark-shell Runs spark scala interactive commandline
pyspark Runs python spark interactive commandline
sparkR Runs R on spark (/usr/spark2.6/bin/sparkR)
spark-submit Submit a jar or python application for execution on cluster
spark-sql Runs the spark sql interactive shell

Introduction
Getting Started - CloudxLab

To launch Spark on Hadoop,


Set the Environment Variables pointing to Hadoop.

export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/

Introduction
Getting Started - CloudxLab

We have installed other versions too:

1. /usr/spark2.0.1/bin/spark-shell
2. /usr/spark1.6/bin/spark-shell
3. /usr/spark1.2.1/bin/spark-shell

Introduction
Introduction

Thank you!

You might also like