Spark Introduction
Spark Introduction
Introduction
Spark Architecture
Data Sources
HDFS
HBase
Hive
Dataframes Streaming MLLib GraphX Libraries
Tachyon
Spark Core
Cassandra
Hadoop
Amazon EC2 Standalone Apache Mesos
YARN
Resource/cluster managers
Introduction
Why Apache Spark?
Or
Why is Apache Spark faster than MapReduce?
Why Apache Spark?
Read
Input
Introduction
Hadoop Map Reduce - Multiple Phases
Introduction
Shortcoming of Map Reduce
1. Batchwise Design
a. Every map-reduce cycle reads from and writes to
HDFS
b. Heavy Latency
2. Converting logic to Map-Reduce paradigm is difficult
3. In-memory computing was not possible
Introduction
Shortcoming of Map Reduce
See: https://gist.github.com/jboner/2841832
Introduction
Getting Started - CloudxLab
Introduction
Getting Started - Downloading
1. Find out hadoop version:
○ [student@hadoop1 ~]$ hadoop version
○ Hadoop 2.4.0.2.1.4.0-632
2. Go to https://spark.apache.org/downloads.html
3. Select the release for your version of hadoop & Download
4. On servers you could use wget
5. Every download can be run in standalone mode
6. Unzip - tar -xzvf spark*.tgz
7. In this folder, the bin folder contains the spark commands
Introduction
Getting Started - Binaries Overview
Binary Description
spark-shell Runs spark scala interactive commandline
pyspark Runs python spark interactive commandline
sparkR Runs R on spark (/usr/spark2.6/bin/sparkR)
spark-submit Submit a jar or python application for execution on cluster
spark-sql Runs the spark sql interactive shell
Introduction
Starting Spark With Scala Interactive Shell
$ spark-shell
It is basically the scala REPL or interactive shell with one extra variable “sc”.
Check dir(sc) or help(sc)
Introduction
Starting Spark With Python Interactive Shell
$ pyspark
The example computes the area of circle of a radius 1 by counting total number of
squares.
○ See https://en.wikipedia.org/wiki/Approximations_of_%CF%80#Summing_a_circle.27s_area
○ Code:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala
Introduction
Getting Started - spark-submit
Introduction
Getting Started - Binaries Overview
Binary Description
spark-shell Runs spark scala interactive commandline
pyspark Runs python spark interactive commandline
sparkR Runs R on spark (/usr/spark2.6/bin/sparkR)
spark-submit Submit a jar or python application for execution on cluster
spark-sql Runs the spark sql interactive shell
Introduction
Getting Started - CloudxLab
export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/
Introduction
Getting Started - CloudxLab
1. /usr/spark2.0.1/bin/spark-shell
2. /usr/spark1.6/bin/spark-shell
3. /usr/spark1.2.1/bin/spark-shell
Introduction
Introduction
Thank you!