Week 8_Lecture Notes
Week 8_Lecture Notes
EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Preface
EL
spark’, Resilient Distributed Datasets (RDDs) and also
discuss some of its applications such as: Page rank and
GraphX.
PT
N
EL
has gained a lot of attraction both in academia and in
industry.
PT
It is an another system for big data analytics
N
Isn’t MapReduce good enough?
Simplifies batch processing on large commodity clusters
EL
Input
PT Output
N
EL
Input
PT
Expensive save to disk for fault
tolerance
Output
N
EL
Lacks efficient data sharing
PT
Specialized frameworks did evolve for different programming
N
models
Bulk Synchronous Processing (Pregel)
Iterative MapReduce (Hadoop) ….
EL
Built through coarse grained transformations (map, join …)
Can be cached for efficient reuse
PT
N
Read
EL
HDFS
Read
PT Cache
N
Map Reduce
Cloud Computing and DistributedVuSystems
Pham Introduction to Spark
Solution: Resilient Distributed Datasets (RDDs)
EL
Built through coarse grained transformations (map, join …)
Fault Recovery?
Lineage! PT
Log the coarse grained operation applied to a
N
partitioned dataset
Simply recompute the lost partition if failure occurs!
No cost if no failure
Read
EL
HDFS
Read Cache
PT
N
Map Reduce
PT Introduction to Spark
N
Vu Pham
RDD RDD RDD
Read
EL
HDFS RDDs track the graph of
Read transformations that built them Cache
PT
(their lineage) to rebuild lost data
N
Map Reduce
EL
Control
PT
Partitioning: Spark also gives you control over how you can
partition your RDDs.
N
Persistence: Allows you to choose whether you want to
persist RDD onto disk or not.
EL
Joins take place
repeatedly
PT Good partitioning
reduces shuffles
N
EL
Links from a high-rank page high rank
PT
N
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
EL
Step-3 Set each page’s rank to 0.15 + 0.85 x contributions
PT
N
EL
val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>
} PT
links.map(dest => (dest, rank/links.size))
ranks = contribs.reduceByKey (_ + _)
N
.mapValues (0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)
EL
PT
N
EL
PT
N
Triplets
EL
PT
N
Graph Represented In a Table
Triplets
EL
PT
N
Gather at A
Group-By A
EL
PT
N
Apply
Map
EL
PT
N
Scatter
Triplets Join
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
2. More algorithms
a) LDA (topic modeling)
PT
b) Correlation clustering
3. Research
N
a) Local graphs
b) Streaming/time-varying graphs
c) Graph database–like queries
EL
iii. K-means clustering
PT
iv. Alternating Least Squares matrix factorization
N
v. In-memory OLAP aggregation on Hive data
EL
Matei Zaharia, Mosharaf Chowdhury et al.
PT
“Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing”
N
https://spark.apache.org/
EL
Generalized to a broad set of applications
PT
Leverages coarse-grained nature of parallel
algorithms for failure recovery
N
EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Preface
Content of this Lecture:
Define Kafka
EL
Describe some use cases for Kafka
PT
Describe the Kafka data model
Batch Streaming
EL
PT
N
Vu Pham
Introduction: Apache Kafka
Kafka is a high-performance, real-time messaging
system. It is an open source tool and is a part of Apache
projects.
EL
The characteristics of Kafka are:
PT
1. It is a distributed and partitioned messaging system.
2. It is highly fault-tolerant
N
3. It is highly scalable.
4. It can process and send millions of messages per second
to several receivers.
EL
It became a main Apache project in October, 2012.
A stable Apache Kafka version 0.8.2.0 was release in Feb,
2015.
PT
A stable Apache Kafka version 0.8.2.1 was released in May,
N
2015, which is the latest version.
EL
PT
N
EL
• Web site: page views, clicks, searches, …
• IoT: sensor readings, …
PT
and so on.N
EL
PT
N
EL
PT
N
EL
The processes that publish messages into a topic in Kafka are known as
producers.
The processes that receive the messages from a topic in Kafka are known as
consumers.
PT
The processes or servers within Kafka that process the messages are known as
brokers.
N
A Kafka cluster consists of a set of brokers that process the messages.
EL
A partition is also known as a commit log.
Each partition contains an ordered set of messages.
PT
Each message is identified by its offset in the partition.
Messages are added at one end of the partition and consumed
at the other.
N
EL
multiple servers.
A topic can have any number of partitions.
Each partition should fit in a single Kafka server.
PT
The number of partitions decide the parallelism of the topic.
N
EL
marked as followers.
The leader controls the read and write for the partition, whereas, the
followers replicate the data.
PT
If a leader fails, one of the followers automatically become the leader.
Zookeeper is used for the leader selection.
N
EL
Topics should already exist before a message is placed by the producer.
Messages are added at one end of the partition.
PT
N
EL
The consumers specify what topics they want to listen to.
A message is sent to all the consumers in a consumer group.
The consumer groups are used to control the messaging system.
PT
N
EL
the same order.
• Each partition acts as a message queue.
PT
• Consumers are divided into consumer groups.
• Each message is delivered to one consumer in each consumer group.
• Zookeeper is used for coordination.
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
• They coordinate among each other using Zookeeper.
PT
• One broker acts as a leader for a partition and handles the
delivery and persistence, where as, the others act as followers.
N
EL
are appended in the same order
PT
2. A consumer instance gets the messages in the same
order as they are produced.
N
3. A topic with replication factor N, tolerates upto N-1
server failures.
EL
chosen as the followers and act as backups.
The leader propagates the writes to the followers.
replicas. PT
The leader waits until the writes are completed on all the
EL
and writes.
All the data is immediately written to a file in file system.
writes. PT
Messages are grouped as message sets for more efficient
EL
2. Kafka Connect: A framework to import event streams from
other source data systems into Kafka and export event
streams from Kafka to destination data systems.
3.
they occur.
PT
Kafka Streams: A Java library to process event streams live as
N
EL
o Source Code https://github.com/apache/kafka/tree/trunk/streams
Kafka Streams Java docs
PT
o
http://docs.confluent.io/current/streams/javadocs/index.html
EL
Kafka data model consists of messages and topics.
PT
Kafka architecture consists of brokers that take messages from the
producers and add to a partition of a topics.
N
Kafka architecture supports two types of messaging system called
publish-subscribe and queue system.
Brokers are the Kafka processes that process the messages in Kafka.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka