0% found this document useful (0 votes)
47 views

Week 8_Lecture Notes

The document provides an introduction to Apache Spark, focusing on its framework, Resilient Distributed Datasets (RDDs), and applications like PageRank and GraphX. It discusses the need for Spark over MapReduce, emphasizing its efficiency in handling big data analytics through features like fault tolerance and data sharing. Additionally, the document touches on Spark's capabilities in various programming models and its applications in different domains.

Uploaded by

bheeshma9347
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Week 8_Lecture Notes

The document provides an introduction to Apache Spark, focusing on its framework, Resilient Distributed Datasets (RDDs), and applications like PageRank and GraphX. It discusses the need for Spark over MapReduce, emphasizing its efficiency in handling big data analytics through features like fault tolerance and data sharing. Additionally, the document touches on Spark's capabilities in various programming models and its applications in different domains.

Uploaded by

bheeshma9347
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Introduction to Spark

EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Preface

Content of this Lecture:

In this lecture, we will discuss the ‘framework of

EL
spark’, Resilient Distributed Datasets (RDDs) and also
discuss some of its applications such as: Page rank and
GraphX.
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Need of Spark
Apache Spark is a big data analytics framework that
was originally developed at the University of
California, Berkeley's AMPLab, in 2012. Since then, it

EL
has gained a lot of attraction both in academia and in
industry.

PT
It is an another system for big data analytics
N
Isn’t MapReduce good enough?
Simplifies batch processing on large commodity clusters

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Need of Spark
Map Reduce

EL
Input
PT Output
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Need of Spark
Map Reduce

EL
Input
PT
Expensive save to disk for fault
tolerance
Output
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Need of Spark
MapReduce can be expensive for some applications e.g.,
Iterative
Interactive

EL
Lacks efficient data sharing

PT
Specialized frameworks did evolve for different programming
N
models
Bulk Synchronous Processing (Pregel)
Iterative MapReduce (Hadoop) ….

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Solution: Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Immutable, partitioned collection of records

EL
Built through coarse grained transformations (map, join …)
Can be cached for efficient reuse

PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Need of Spark
RDD RDD RDD

Read

EL
HDFS
Read
PT Cache
N

Map Reduce
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Solution: Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

Immutable, partitioned collection of records

EL
Built through coarse grained transformations (map, join …)

Fault Recovery?
Lineage! PT
Log the coarse grained operation applied to a
N
partitioned dataset
Simply recompute the lost partition if failure occurs!
No cost if no failure

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
RDD RDD RDD

Read

EL
HDFS
Read Cache

PT
N
Map Reduce

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
EL
Read
HDFS Map Reduce
Lineage

PT Introduction to Spark
N

Vu Pham
RDD RDD RDD

Read

EL
HDFS RDDs track the graph of
Read transformations that built them Cache

PT
(their lineage) to rebuild lost data
N
Map Reduce

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
What can you do with Spark?
RDD operations
Transformations e.g., filter, join, map, group-by …
Actions e.g., count, print …

EL
Control

PT
Partitioning: Spark also gives you control over how you can
partition your RDDs.
N
Persistence: Allows you to choose whether you want to
persist RDD onto disk or not.

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Partitioning: PageRank

EL
Joins take place
repeatedly

PT Good partitioning
reduces shuffles
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Example: PageRank

Give pages ranks (scores) based on links to them


Links from many pages high rank

EL
Links from a high-rank page high rank

PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Algorithm
Step-1 Start each page at a rank of 1
Step-2 On each iteration, have page p contribute rankp/ |neighborsp|
to its neighbors

EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions

PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Spark Program

val links = // RDD of (url, neighbors) pairs


var ranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) {

EL
val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>

} PT
links.map(dest => (dest, rank/links.size))

ranks = contribs.reduceByKey (_ + _)
N
.mapValues (0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
PageRank Performance

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Generality
RDDs allow unification of different programming models
Stream Processing
Graph Processing
Machine Learning

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Gather-Apply-Scatter on GraphX

Triplets

EL
PT
N
Graph Represented In a Table

Triplets

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Gather-Apply-Scatter on GraphX

EL
PT
N
Gather at A
Group-By A

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Gather-Apply-Scatter on GraphX

EL
PT
N
Apply
Map

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Gather-Apply-Scatter on GraphX

EL
PT
N
Scatter
Triplets Join

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
The GraphX API

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Graphs Property

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Creating a Graph (Scala)

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Graph Operations (Scala)

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Built-in Algorithms (Scala)

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
The triplets view

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
The subgraph transformation

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
The subgraph transformation

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Computation with aggregateMessages

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Computation with aggregateMessages

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Example: Graph Coarsening

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
How GraphX Works

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Storing Graphs as Tables

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Simple Operations
Reuse vertices or edges across multiple graphs

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Implementing triplets

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Implementing triplets

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Implementing aggregateMessages

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Future of GraphX
1. Language support
a) Java API
b) Python API: collaborating with Intel, SPARK-3789

EL
2. More algorithms
a) LDA (topic modeling)

PT
b) Correlation clustering
3. Research
N
a) Local graphs
b) Streaming/time-varying graphs
c) Graph database–like queries

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Other Spark Applications
i. Twitter spam classification

ii. EM algorithm for traffic prediction

EL
iii. K-means clustering

PT
iv. Alternating Least Squares matrix factorization
N
v. In-memory OLAP aggregation on Hive data

vi. SQL on Spark

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Reading Material
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin,
Scott Shenker, Ion Stoica
“Spark: Cluster Computing with Working Sets”

EL
Matei Zaharia, Mosharaf Chowdhury et al.

PT
“Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing”
N
https://spark.apache.org/

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Conclusion

RDDs (Resilient Distributed Datasets (RDDs) provide


a simple and efficient programming model

EL
Generalized to a broad set of applications

PT
Leverages coarse-grained nature of parallel
algorithms for failure recovery
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Spark
Introduction to Kafka

EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Preface
Content of this Lecture:

Define Kafka

EL
Describe some use cases for Kafka

PT
Describe the Kafka data model

Describe Kafka architecture


N
List the types of messaging systems

Explain the importance of brokers


Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Batch vs. Streaming

Batch Streaming

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
EL
PT
N

Vu Pham
Introduction: Apache Kafka
Kafka is a high-performance, real-time messaging
system. It is an open source tool and is a part of Apache
projects.

EL
The characteristics of Kafka are:

PT
1. It is a distributed and partitioned messaging system.
2. It is highly fault-tolerant
N
3. It is highly scalable.
4. It can process and send millions of messages per second
to several receivers.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Kafka History
Apache Kafka was originally developed by LinkedIn and
later, handed over to the open source community in early
2011.

EL
It became a main Apache project in October, 2012.
A stable Apache Kafka version 0.8.2.0 was release in Feb,
2015.
PT
A stable Apache Kafka version 0.8.2.1 was released in May,
N
2015, which is the latest version.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Kafka Use Cases
Kafka can be used for various purposes in an organization,
such as:

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Apache Kafka: a Streaming Data Platform
 Most of what a business does can be thought as event
streams. They are in a
• Retail system: orders, shipments, returns, …
• Financial system: stock ticks, orders, …

EL
• Web site: page views, clicks, searches, …
• IoT: sensor readings, …

PT
and so on.N

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Enter Kafka
Adopted at 1000s of companies worldwide

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Aggregating User Activity Using Kafka-Example

Kafka can be used to aggregate user activity data such as clicks,


navigation, and searches from different websites of an
organization; such user activities can be sent to a real-time
monitoring system and hadoop system for offline processing.

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Kafka Data Model
The Kafka data model consists of messages and topics.
Messages represent information such as, lines in a log file, a row of stock
market data, or an error message from a system.
Messages are grouped into categories called topics.
Example: LogMessage and Stock Message.

EL
The processes that publish messages into a topic in Kafka are known as
producers.
The processes that receive the messages from a topic in Kafka are known as
consumers.
PT
The processes or servers within Kafka that process the messages are known as
brokers.
N
A Kafka cluster consists of a set of brokers that process the messages.

Vu Pham Introduction to Kafka


Topics
A topic is a category of messages in Kafka.
The producers publish the messages into topics.
The consumers read the messages from topics.
A topic is divided into one or more partitions.

EL
A partition is also known as a commit log.
Each partition contains an ordered set of messages.

PT
Each message is identified by its offset in the partition.
Messages are added at one end of the partition and consumed
at the other.
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Partitions
Topics are divided into partitions, which are the unit of
parallelism in Kafka.

Partitions allow messages in a topic to be distributed to

EL
multiple servers.
A topic can have any number of partitions.
Each partition should fit in a single Kafka server.
PT
The number of partitions decide the parallelism of the topic.
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Partition Distribution
Partitions can be distributed across the Kafka cluster.
Each Kafka server may handle one or more partitions.
A partition can be replicated across several servers fro fault-tolerance.
One server is marked as a leader for the partition and the others are

EL
marked as followers.
The leader controls the read and write for the partition, whereas, the
followers replicate the data.

PT
If a leader fails, one of the followers automatically become the leader.
Zookeeper is used for the leader selection.
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Producers
The producer is the creator of the message in Kafka.

The producers place the message to a particular topic.


The producers also decide which partition to place the message into.

EL
Topics should already exist before a message is placed by the producer.
Messages are added at one end of the partition.

PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Consumers
The consumer is the receiver of the message in Kafka.

Each consumer belongs to a consumer group.


A consumer group may have one or more consumers.

EL
The consumers specify what topics they want to listen to.
A message is sent to all the consumers in a consumer group.
The consumer groups are used to control the messaging system.

PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Kafka Architecture
Kafka architecture consists of brokers that take messages from the
producers and add to a partition of a topic. Brokers provide the
messages to the consumers from the partitions.
• A topic is divided into multiple partitions.
• The messages are added to the partitions at one end and consumed in

EL
the same order.
• Each partition acts as a message queue.

PT
• Consumers are divided into consumer groups.
• Each message is delivered to one consumer in each consumer group.
• Zookeeper is used for coordination.
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Types of Messaging Systems
Kafka architecture supports the publish-subscribe and queue system.

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Example: Queue System

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Example: Publish-Subscribe System

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Brokers
Brokers are the Kafka processes that process the messages in Kafka.

• Each machine in the cluster can run one broker.

EL
• They coordinate among each other using Zookeeper.

PT
• One broker acts as a leader for a partition and handles the
delivery and persistence, where as, the others act as followers.
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Kafka Guarantees
Kafka guarantees the following:

1. Messages sent by a producer to a topic and a partition

EL
are appended in the same order

PT
2. A consumer instance gets the messages in the same
order as they are produced.
N
3. A topic with replication factor N, tolerates upto N-1
server failures.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Replication in Kafka
Kafka uses the primary-backup method of replication.
One machine (one replica) is called a leader and is chosen
as the primary; the remaining machines (replicas) are

EL
chosen as the followers and act as backups.
The leader propagates the writes to the followers.

replicas. PT
The leader waits until the writes are completed on all the

If a replica is down, it is skipped for the write until it


N
comes back.
If the leader fails, one of the followers will be chosen as
the new leader; this mechanism can tolerate n-1 failures if
the replication factor is ‘n’
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Persistence in Kafka
Kafka uses the Linux file system for persistence of messages
Persistence ensures no messages are lost.
Kafka relies on the file system page cache for fast reads

EL
and writes.
All the data is immediately written to a file in file system.

writes. PT
Messages are grouped as message sets for more efficient

Message sets can be compressed to reduce network


N
bandwidth.
A standardized binary message format is used among
producers, brokers, and consumers to minimize data
modification.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Apache Kafka: a Streaming Data Platform
 Apache Kafka is an open source streaming data platform (a new
category of software!) with 3 major components:
1. Kafka Core: A central hub to transport and store event
streams in real-time.

EL
2. Kafka Connect: A framework to import event streams from
other source data systems into Kafka and export event
streams from Kafka to destination data systems.
3.
they occur.
PT
Kafka Streams: A Java library to process event streams live as
N

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Further Learning
o Kafka Streams code examples
o Apache Kafka
https://github.com/apache/kafka/tree/trunk/streams/examples/src/main/java/org/apache/kafka/
streams/examples
o Confluent https://github.com/confluentinc/examples/tree/master/kafka-streams

EL
o Source Code https://github.com/apache/kafka/tree/trunk/streams
Kafka Streams Java docs

PT
o
http://docs.confluent.io/current/streams/javadocs/index.html

o First book on Kafka Streams (MEAP)


o Kafka Streams in Action https://www.manning.com/books/kafka-streams-in-action
N
o Kafka Streams download
o Apache Kafka https://kafka.apache.org/downloads
o Confluent Platform http://www.confluent.io/download

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Conclusion
Kafka is a high-performance, real-time messaging system.

Kafka can be used as an external commit log for distributed


systems.

EL
Kafka data model consists of messages and topics.

PT
Kafka architecture consists of brokers that take messages from the
producers and add to a partition of a topics.
N
Kafka architecture supports two types of messaging system called
publish-subscribe and queue system.

Brokers are the Kafka processes that process the messages in Kafka.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka

You might also like