0% found this document useful (0 votes)
36 views

Big Data Open Source Frameworks Lecture Slides

Big data comes from many sources like social media, e-commerce sites, weather stations, and telecom companies. It is characterized by its large volume, variety, velocity, veracity, and value. Hadoop is an open source framework that stores, processes, and analyzes big data using HDFS for storage and MapReduce for processing. HDFS stores data reliably in large files distributed across commodity hardware. Hadoop was created in 2006 based on earlier work by Google on distributed file systems and processing. It has since been widely adopted by tech companies to handle ever-increasing data volumes.

Uploaded by

apna india
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Big Data Open Source Frameworks Lecture Slides

Big data comes from many sources like social media, e-commerce sites, weather stations, and telecom companies. It is characterized by its large volume, variety, velocity, veracity, and value. Hadoop is an open source framework that stores, processes, and analyzes big data using HDFS for storage and MapReduce for processing. HDFS stores data reliably in large files distributed across commodity hardware. Hadoop was created in 2006 based on earlier work by Google on distributed file systems and processing. It has since been widely adopted by tech companies to handle ever-increasing data volumes.

Uploaded by

apna india
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 109

• What is Big Data

• Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15
byte size is called Big Data. It is stated that almost 90% of today's data has been generated in
the past 3 years.
• Sources of Big Data
• These data come from many sources like
• Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount
of data on a day to day basis as they have billions of users worldwide.
• E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from
which users buying trends can be traced.
• Weather Station: All the weather station and satellite gives very huge data which are stored
and manipulated to forecast weather.
• Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
• Share Market: Stock exchange across the world generates huge amount of data through its
daily transaction.
• Big Data Characteristics
• Big Data contains a large amount of data that is not being processed by
traditional data storage or the processing unit. It is used by many multinational
companies to process the data and business of many organizations. The data
flow would exceed 150 exabytes per day before replication.
• There are five v's of Big Data that explains the characteristics.
• 5 V's of Big Data
• Volume
• Veracity
• Variety
• Value
• Velocity
• Volume
• The name Big Data itself is related to an enormous size. Big Data is a vast
'volumes' of data generated from many sources daily, such as business processes,
machines, social media platforms, networks, human interactions, and many
more.
• Facebook can generate approximately a billion messages, 4.5 billion times that
the "Like" button is recorded, and more than 350 million new posts are uploaded
each day. Big data technologies can handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
• The data is categorized as below:
• Structured data: In Structured schema, along with all the required columns. It is in a
tabular form. Structured Data is stored in the relational database management system.
• Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON,
XML, CSV, TSV, and email. OLTP (Online Transaction Processing) systems are built to
work with semi-structured data. It is stored in relations, i.e., tables.
• Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but they did
not know how to derive the value of data since the data is raw.
• Quasi-structured Data:The data format contains textual data with inconsistent data formats
that are formatted with effort and time with some tools.
• Example: Web server logs, i.e., the log file is created and maintained by some server that
contains a list of activities.
• Veracity
• Veracity means how much the data is reliable. It has many ways to filter or
translate the data. Veracity is the process of being able to handle and manage data
efficiently. Big Data is also essential in business development.
• For example, Facebook posts with hashtags.
• Value
• Value is an essential characteristic of big data. It is not the data that we process or
store. It is valuable and reliable data that we store, process, and also analyze.
• Velocity
• Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.
• Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
• What is Hadoop
• Hadoop is an open source framework from Apache and is used to store process and
analyze data which are very huge in volume. Hadoop is written in Java and is not OLAP
(online analytical processing). It is used for batch/offline processing.It is being used by
Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up
just by adding nodes in the cluster.
• Modules of Hadoop
• HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis
of that HDFS was developed. It states that the files will be broken into blocks and stored in
nodes over the distributed architecture.
• Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
• Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts it
into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.
• Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google
File System paper, published by Google.
• Let's focus on the history of Hadoop in the following steps: -
• In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an open source web
crawler software project.
• While working on Apache Nutch, they were dealing with big data. To store that data they have to spend a lot of
costs which becomes the consequence of that project. This problem becomes one of the important reason for the
emergence of Hadoop.
• In 2003, Google introduced a file system known as GFS (Google file system). It is a proprietary distributed file
system developed to provide efficient access to data.
• In 2004, Google released a white paper on Map Reduce. This technique simplifies the data processing on large
clusters.
• In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS (Nutch Distributed File
System). This file system also includes Map reduce.
• In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project, Dough Cutting
introduces a new project Hadoop with a file system known as HDFS (Hadoop Distributed File System). Hadoop
first version 0.1.0 released in this year.
• Doug Cutting gave named his project Hadoop after his son's toy elephant.
• In 2007, Yahoo runs two clusters of 1000 machines.
• In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster within 209 seconds.
• In 2013, Hadoop 2.2 was released.
• In 2017, Hadoop 3.0 was released.
Year Event

2003 Google released the paper, Google File System (GFS).

2004 Google released a white paper on Map Reduce.

2006 •Hadoop introduced.


•Hadoop 0.1.0 released.
•Yahoo deploys 300 machines and within this year reaches 600
machines.

2007 •Yahoo runs 2 clusters of 1000 machines.


•Hadoop includes HBase.

2008 •YARN JIRA opened


•Hadoop becomes the fastest system to sort 1 terabyte of data
on a 900 node cluster within 209 seconds.
•Yahoo clusters loaded with 10 terabytes per day.
•Cloudera was founded as a Hadoop distributor.
2009 •Yahoo runs 17 clusters of 24,000 machines.
•Hadoop becomes capable enough to sort a petabyte.
•MapReduce and HDFS become separate
subproject.
2010 •Hadoop added the support for Kerberos.
•Hadoop operates 4,000 nodes with 40 petabytes.
•Apache Hive and Pig released.

2011 •Apache Zookeeper released.


•Yahoo has 42,000 Hadoop nodes and hundreds of
petabytes of storage.
2012 Apache Hadoop 1.0 version released.

2013 Apache Hadoop 2.2 version released.

2014 Apache Hadoop 2.6 version released.

2015 Apache Hadoop 2.7 version released.

2017 Apache Hadoop 3.0 version released.


2018 Apache Hadoop 3.1 version released.

What is HDFS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over
several machines and replicated to ensure their durability to failure and high availability to
parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes
and node name.
Where to use HDFS
•Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
•Streaming Data Access: The time to read whole data set is more important than latency in
reading the first. HDFS is built on write-once and read-many-times pattern.
•Commodity Hardware:It works on low cost hardware.
• Where not to use HDFS
• Low Latency data access: Applications that require very less time to access the
first data should not use HDFS as it is giving importance to whole data rather than
time to fetch the first record.
• Lots Of Small Files:The name node contains the metadata of files in memory and
if the files are small in size it takes a lot of memory for name node's memory
which is not feasible.
• Multiple Writes:It should not be used when we have to write multiple times.
• HDFS Concepts
• Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks
are 128 MB by default and this is configurable.Files n HDFS are broken into block-sized
chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS
is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file
stored in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is
large just to minimize the cost of seek.
• Name Node: HDFS works in master-worker pattern where the name node acts as
master.Name Node is controller and manager of HDFS as it knows the status and the
metadata of all the files in HDFS; the metadata information being file permission, names
and location of each block.The metadata are small, so it is stored in the memory of name
node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently,so all this information is handled bya single machine. The file
system operations like opening, closing, renaming etc. are executed by it.
• Data Node: They store and retrieve blocks when they are told to; by client or name
node. They report back to name node periodically, with list of blocks that they are
storing. The data node being a commodity hardware also does the work of block
creation, deletion and replication as stated by the name node.
HDFS DataNode and NameNode Image:
HDFS Read Image:
HDFS Write Image:
• Since all the metadata is stored in name node, it is very important. If it fails the
file system can not be used as there would be no way of knowing how to
reconstruct the files from blocks present in data node. To overcome this, the
concept of secondary name node arises.
• Secondary Name Node: It is a separate physical machine which acts as a helper
of name node. It performs periodic check points.It communicates with the name
node and take snapshot of meta data which helps minimize downtime and loss of
data.
• Starting HDFS
• The HDFS should be formatted initially and then started in the distributed mode.
Commands are given below.
• To Format $ hadoop namenode -format
• To Start $ start-dfs.sh
• HDFS Basic File Operations
• Putting data to HDFS from local file system
• First create a folder in HDFS where data can be put form local file system.
• $ hadoop fs -mkdir /user/test
• Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFS folder
/user/ test
• $ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test
• Display the content of HDFS folder
• $ Hadoop fs -ls /user/test
• Copying data from HDFS to local file system
• $ hadoop fs -copyToLocal /user/test/data.txt /usr/bin/data_copy.txt
• Compare the files and see that both are same
• $ md5 /usr/bin/data_copy.txt /usr/home/Desktop/data.txt
• Recursive deleting
• hadoop fs -rmr <arg>
• HDFS Features and Goals
• The Hadoop Distributed File System (HDFS) is a distributed file system. It is a core part of
Hadoop which is used for data storage. It is designed to run on commodity hardware.
• Unlike other distributed file system, HDFS is highly fault-tolerant and can be deployed on low-
cost hardware. It can easily handle the application that contains large data sets.
• Let's see some of the important features and goals of HDFS.
• Features of HDFS
• Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single cluster.
• Replication - Due to some unfavorable conditions, the node containing the data may be loss. So,
to overcome such problems, HDFS always maintains the copy of data on a different machine.
• Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in the event
of failure. The HDFS is highly fault-tolerant that if any machine fails, the other machine
containing the copy of that data automatically become active.
• Distributed data storage - This is one of the most important features of HDFS that makes Hadoop
very powerful. Here, data is divided into multiple blocks and stored into nodes.
• Portable - HDFS is designed in such a way that it can easily portable from platform to another.
• Goals of HDFS
• Handling the hardware failure - The HDFS contains multiple server machines.
Anyhow, if any machine fails, the HDFS goal is to recover it quickly.
• Streaming data access - The HDFS applications usually run on the general-
purpose file system. This application requires streaming access to their data sets.
• Coherence Model - The application that runs on HDFS require to follow the write-
once-ready-many approach. So, a file once created need not to be changed.
However, it can be appended and truncate.
• What is MapReduce?
• A MapReduce is a data processing tool which is used to process the data parallelly in a
distributed form. It was developed in 2004, on the basis of paper titled as "MapReduce:
Simplified Data Processing on Large Clusters," published by Google.
• The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer
phase. In the Mapper, the input is given in the form of a key-value pair. The output of the
Mapper is fed to the reducer as input. The reducer runs only after the Mapper is over. The
reducer too takes input in key-value format, and the output of reducer is the final output.
• Steps in Map Reduce
• The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys
will not be unique in this case.
• Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This
sort and shuffle acts on these list of <key, value> pairs and sends out unique keys and a
list of values associated with this unique key <key, list(values)>.
• An output of sort and shuffle sent to the reducer phase. The reducer performs a defined
function on a list of values for unique keys, and Final output <key, value> will be
stored/displayed.
• Sort and Shuffle
• The sort and shuffle occur on the output of Mapper and before the reducer.
When the Mapper task is complete, the results are sorted by key, partitioned if
there are multiple reducers, and then written to disk. Using the input from each
Mapper <k2,v2>, we collect all the values for each unique key k2. This output
from the shuffle phase in the form of <k2, list(v2)> is sent as input to reducer
phase.
• Usage of MapReduce
• It can be used in various application like document clustering, distributed sorting,
and web link-graph reversal.
• It can be used for distributed pattern-based searching.
• We can also use MapReduce in machine learning.
• It was used by Google to regenerate Google's index of the World Wide Web.
• It can be used in multiple computing environments such as multi-cluster, multi-
core, and mobile environment.
Data Flow In MapReduce
MapReduce is used to compute the huge amount of data . To handle the upcoming data in a
parallel and distributed form, the data has to flow from various phases.

Phases of MapReduce data flow


Input reader
The input reader reads the upcoming data and splits it into the data blocks of the appropriate size
(64 MB to 128 MB). Each data block is associated with a Map function.
Once input reads the data, it generates the corresponding key-value pairs. The input files reside
in HDFS.
• Map function
• The map function process the upcoming key-value pairs and generated the corresponding
output key-value pairs. The map input and output type may be different from each other.
• Partition function
• The partition function assigns the output of each Map function to the appropriate reducer.
The available key and value provide this function. It returns the index of reducers.
• Shuffling and Sorting
• The data are shuffled between/within nodes so that it moves out from the map and get
ready to process for reduce function. Sometimes, the shuffling of data can take much
computation time.
• The sorting operation is performed on input data for Reduce function. Here, the data is
compared using comparison function and arranged in a sorted form.
• Reduce function
• The Reduce function is assigned to each unique key. These keys are already arranged in
sorted order. The values associated with the keys can iterate the Reduce and generates
the corresponding output.
• Output writer
• Once the data flow from all the above phases, Output writer executes. The role of
Output writer is to write the Reduce output to the stable storage.
• MapReduce API
• In this section, we focus on MapReduce APIs. Here, we learn about the classes
and methods used in MapReduce programming.
• MapReduce Mapper Class
• In MapReduce, the role of the Mapper class is to map the input key-value pairs to
a set of intermediate key-value pairs. It transforms the input records into
intermediate records.
• These intermediate records associated with a given output key and passed to
Reducer for the final output.
• Methods of Mapper Class
void cleanup(Context context) This method called only once at the end of the
task.

void map(KEYIN key, VALUEIN value, This method can be called only once for each
Context context) key-value in the input split.

void run(Context context) This method can be override to control the


execution of the Mapper.

void setup(Context context) This method called only once at the beginning
of the task.
void cleanup(Context context) This method called only once at the end of the
task.

void map(KEYIN key, Iterable<VALUEIN> This method called only once for each key.
values, Context context)

void run(Context context) This method can be used to control the tasks of
the Reducer.

void setup(Context context) This method called only once at the beginning
of the task.
• MapReduce Job Class
• The Job class is used to configure the job and submits it. It also controls the
execution and query the state. Once the job is submitted, the set method throws
IllegalStateException.
• Methods of Job Class
Methods Description
Counters getCounters() This method is used to get the counters for the
job.
long getFinishTime() This method is used to get the finish time for
the job.
Job getInstance() This method is used to generate a new Job
without any cluster.
Job getInstance(Configuration conf) This method is used to generate a new Job
without any cluster and provided configuration.
Job getInstance(Configuration conf, String This method is used to generate a new Job
jobName) without any cluster and provided configuration
and job name.
String getJobFile() This method is used to get the path of the
submitted job configuration.
String getJobName() This method is used to get the user-specified
job name.
void setJobName(String name) This method is used to set the user-specified
job name.

void setMapOutputKeyClass(Class<?> class) This method is used to set the key class for
the map output data.

void setMapOutputValueClass(Class<?> This method is used to set the value class for
class) the map output data.

void setMapperClass(Class<? extends This method is used to set the Mapper for the
Mapper> class) job.

void setNumReduceTasks(int tasks) This method is used to set the number of


reduce tasks for the job

void setReducerClass(Class<? extends This method is used to set the Reducer for the
Reducer> class) job.
• What is Spark?
• Apache Spark is an open-source cluster computing framework. Its primary
purpose is to handle the real-time generated data.
• Spark was built on the top of the Hadoop MapReduce. It was optimized to run in
memory whereas alternative approaches like Hadoop's MapReduce writes data
to and from computer hard drives. So, Spark process the data much quicker than
other alternatives.
• History of Apache Spark
• The Spark was initiated by Matei Zaharia at UC Berkeley's AMPLab in 2009. It was
open sourced in 2010 under a BSD license.
• In 2013, the project was acquired by Apache Software Foundation. In 2014, the
Spark emerged as a Top-Level Apache Project.
• Features of Apache Spark
• Fast - It provides high performance for both batch and streaming data, using a
state-of-the-art DAG scheduler, a query optimizer, and a physical execution
engine.
• Easy to Use - It facilitates to write the application in Java, Scala, Python, R, and
SQL. It also provides more than 80 high-level operators.
• Generality - It provides a collection of libraries including SQL and DataFrames,
MLlib for machine learning, GraphX, and Spark Streaming.
• Lightweight - It is a light unified analytics engine which is used for large scale data
processing.
• Runs Everywhere - It can easily run on Hadoop, Apache Mesos, Kubernetes,
standalone, or in the cloud.
• Uses of Spark
• Data integration: The data generated by systems are not consistent enough to
combine for analysis. To fetch consistent data from systems we can use processes
like Extract, transform, and load (ETL). Spark is used to reduce the cost and time
required for this ETL process.
• Stream processing: It is always difficult to handle the real-time generated data
such as log files. Spark is capable enough to operate streams of data and refuses
potentially fraudulent operations.
• Machine learning: Machine learning approaches become more feasible and
increasingly accurate due to enhancement in the volume of data. As spark is
capable of storing data in memory and can run repeated queries quickly, it makes
it easy to work on machine learning algorithms.
• Interactive analytics: Spark is able to generate the respond rapidly. So, instead of
running pre-defined queries, we can handle the data interactively.
• Spark Architecture
• The Spark follows the master-slave architecture. Its cluster consists of a single
master and multiple slaves.
• The Spark architecture depends upon two abstractions:
• Resilient Distributed Dataset (RDD)
• Directed Acyclic Graph (DAG)
• Resilient Distributed Datasets (RDD)
• The Resilient Distributed Datasets are the group of data items that can be
stored in-memory on worker nodes. Here,
• Resilient: Restore the data on failure.
• Distributed: Data is distributed among different nodes.
• Dataset: Group of data.
• We will learn about RDD later in detail.
Directed Acyclic Graph (DAG)
Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on
data. Each node is an RDD partition, and the edge is a transformation on top of data. Here, the
graph refers the navigation whereas directed and acyclic refers to how it is done.
Let's understand the Spark architecture.
• Driver Program
• The Driver Program is a process that runs the main() function of the application
and creates the SparkContext object. The purpose of SparkContext is to
coordinate the spark applications, running as independent sets of processes on a
cluster.
• To run on a cluster, the SparkContext connects to a different type of cluster
managers and then perform the following tasks: -
• It acquires executors on nodes in the cluster.
• Then, it sends your application code to the executors. Here, the application code
can be defined by JAR or Python files passed to the SparkContext.
• At last, the SparkContext sends tasks to the executors to run.
• Cluster Manager
• The role of the cluster manager is to allocate resources across applications. The Spark
is capable enough of running on a large number of clusters.
• It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos
and Standalone Scheduler.
• Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates
to install Spark on an empty set of machines.
• Worker Node
• The worker node is a slave node
• Its role is to run the application code in the cluster.
• Executor
• An executor is a process launched for an application on a worker node.
• It runs tasks and keeps data in memory or disk storage across them.
• It read and write data to the external sources.
• Every application contains its executor.
• ask
• A unit of work that will be sent to one executor.
Spark Components
The Spark project consists of different types of tightly integrated components. At its core, Spark
is a computational engine that can schedule, distribute and monitor multiple applications.
Let's understand each Spark component in detail.
• Spark Core
• The Spark Core is the heart of Spark and performs the core functionality.
• It holds the components for task scheduling, fault recovery, interacting with
storage systems and memory management.
• Spark SQL
• The Spark SQL is built on the top of Spark Core. It provides support for
structured data.
• It allows to query the data via SQL (Structured Query Language) as well as the
Apache Hive variant of SQL?called the HQL (Hive Query Language).
• It supports JDBC and ODBC connections that establish a relation between Java
objects and existing databases, data warehouses and business intelligence tools.
• It also supports various sources of data like Hive tables, Parquet, and JSON.
• Spark Streaming
• Spark Streaming is a Spark component that supports scalable and fault-tolerant
processing of streaming data.
• It uses Spark Core's fast scheduling capability to perform streaming analytics.
• It accepts data in mini-batches and performs RDD transformations on that data.
• Its design ensures that the applications written for streaming data can be reused to
analyze batches of historical data with little modification.
• The log files generated by web servers can be considered as a real-time example of a
data stream.
• MLlib
• The MLlib is a Machine Learning library that contains various machine learning
algorithms.
• These include correlations and hypothesis testing, classification and regression,
clustering, and principal component analysis.
• It is nine times faster than the disk-based implementation used by Apache Mahout.
• GraphX
• The GraphX is a library that is used to manipulate graphs and perform graph-
parallel computations.
• It facilitates to create a directed graph with arbitrary properties attached to each
vertex and edge.
• To manipulate graph, it supports various fundamental operators like subgraph,
join Vertices, and aggregate Messages.
• What is RDD?
• The RDD (Resilient Distributed Dataset) is the Spark's core abstraction. It is a
collection of elements, partitioned across the nodes of the cluster so that we can
execute various parallel operations on it.
• There are two ways to create RDDs:
• Parallelizing an existing data in the driver program
• Referencing a dataset in an external storage system, such as a shared filesystem,
HDFS, HBase, or any data source offering a Hadoop InputFormat.
Spark & its Features
Apache Spark is an open source cluster computing framework for real-time data processing. The
main feature of Apache Spark is its in-memory cluster computing that increases the processing
speed of an application. Spark provides an interface for programming entire clusters with
implicit data parallelism and fault tolerance. It is designed to cover a wide range of workloads
such as batch applications, iterative algorithms, interactive queries, and streaming.
Features of Apache Spark:
• Speed
Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing. It is
also able to achieve this speed through controlled partitioning.
• Powerful Caching
Simple programming layer provides powerful caching and disk persistence capabilities.
• Deployment
It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster manager.
• Real-Time
It offers Real-time computation & low latency because of in-memory computation.
•  Polyglot
Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written in any of
these four languages. It also provides a shell in Scala and Python.
• Spark Architecture Overview
• Apache Spark has a well-defined layered architecture where all the spark components and layers
are loosely coupled. This architecture is further integrated with various extensions and
libraries. Apache Spark Architecture is based on two main abstractions:
• Resilient Distributed Dataset (RDD)
• Directed Acyclic Graph (DAG)
But before diving any deeper into the Spark architecture, let me explain few fundamental
concepts of Spark like Spark Eco-system and RDD. This will help you in gaining better insights.
Let me first explain what is Spark Eco-System. 
Spark Eco-System
As you can see from the below image, the spark ecosystem is composed of various components
like Spark SQL, Spark Streaming, MLlib, GraphX, and the Core API component.
• Spark Core
Spark Core is the base engine for large-scale parallel and distributed data processing. Further,
additional libraries which are built on the top of the core allows diverse workloads for streaming,
SQL, and machine learning. It is responsible for memory management and fault recovery,
scheduling, distributing and monitoring jobs on a cluster & interacting with storage systems.
• Spark Streaming
Spark Streaming is the component of Spark which is used to process real-time streaming data.
Thus, it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant
stream processing of live data streams.
• Spark SQL
Spark SQL is a new module in Spark which integrates relational processing with Spark’s
functional programming API. It supports querying data either via SQL or via the Hive Query
Language. For those of you familiar with RDBMS, Spark SQL will be an easy transition from
your earlier tools where you can extend the boundaries of traditional relational data processing.
• GraphX
GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark
RDD with a Resilient Distributed Property Graph. At a high-level, GraphX extends the Spark
RDD abstraction by introducing the Resilient Distributed Property Graph (a directed multigraph
with properties attached to each vertex and edge).
• MLlib (Machine Learning)
MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine learning
in Apache Spark.
• SparkR
It is an R package that provides a distributed data frame implementation. It also supports
operations like selection, filtering, aggregation but on large data-sets.
• As you can see, Spark comes packed with high-level libraries, including support for R, SQL,
Python, Scala, Java etc. These standard libraries increase the seamless integrations in a complex
workflow. Over this, it also allows various sets of services to integrate with it like MLlib,
GraphX, SQL + Data Frames, Streaming services etc. to increase its capabilities.
• Now, let’s discuss the fundamental Data Structure of Spark, i.e. RDD.
• Resilient Distributed Dataset(RDD)
• RDDs are the building blocks of any Spark application. RDDs Stands for:
• Resilient: Fault tolerant and is capable of rebuilding data on failure
• Distributed: Distributed data among the multiple nodes in a cluster
• Dataset: Collection of partitioned data with values
• It is a layer of abstracted data over the distributed collection. It is immutable in nature and
follows lazy transformations. 
• Now you might be wondering about its working. Well, the data in an RDD is split into chunks
based on a key. RDDs are highly resilient, i.e, they are able to recover quickly from any issues
as the same data chunks are replicated across multiple executor nodes. Thus, even if one
executor node fails, another will still process the data. This allows you to perform your
functional calculations against your dataset very quickly by harnessing the power of multiple
nodes. 
• Moreover, once you create an RDD it becomes immutable. By immutable I mean, an object
whose state cannot be modified after it is created, but they can surely be transformed.
• Talking about the distributed environment, each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster. Due to this, you can
perform transformations or actions on the complete data parallelly. Also, you don’t have to
worry about the distribution, because Spark takes care of that.
There are two ways to create RDDs − parallelizing an existing collection in your driver program,
or by referencing a dataset in an external storage system, such as a shared file system, HDFS,
HBase, etc.
With RDDs, you can perform two types of operations:
1.Transformations: They are the operations that are applied to create a new RDD.
2.Actions: They are applied on an RDD to instruct Apache Spark to apply computation and pass
the result back to the driver.
I hope you got a thorough understanding of RDD concepts. Now let’s move further and see the
working of Spark Architecture.
• Working of Spark Architecture
• As you have already seen the basic architectural overview of Apache Spark, now let’s dive deeper into
its working.
• In your master node, you have the driver program, which drives your application. The code you are
writing behaves as a driver program or if you are using the interactive shell, the shell acts as the driver
program.
• Inside the driver program, the first thing you do is, you create a Spark Context. Assume that the Spark
context is a gateway to all the Spark functionalities. It is similar to your database connection. Any
command you execute in your database goes through the database connection. Likewise, anything you
do on Spark goes through Spark context.
• Now, this Spark context works with the cluster manager to manage various jobs. The driver program &
Spark context takes care of the job execution within the cluster. A job is split into multiple tasks
which are distributed over the worker node. Anytime an RDD is created in Spark context, it can be
distributed across various nodes and can be cached there.
• Worker nodes are the slave nodes whose job is to basically execute the tasks. These tasks are then
executed on the partitioned RDDs in the worker node and hence returns back the result to the Spark
Context.
• Spark Context takes the job, breaks the job in tasks and distribute them to the worker nodes. These tasks
work on the partitioned RDD, perform operations, collect the results and return to the main Spark
Context.
• If you increase the number of workers, then you can divide jobs into more partitions and
execute them parallelly over multiple systems. It will be a lot faster.
• With the increase in the number of workers, memory size will also increase & you can cache the
jobs to execute it faster.
• To know about the workflow of Spark Architecture, you can have a look at
the infographic below:
• STEP 1: The client submits spark user application code. When an application code is
submitted, the driver implicitly converts user code that contains transformations and
actions into a logically directed acyclic graph called DAG. At this stage, it also performs
optimizations such as pipelining transformations.
• STEP 2: After that, it converts the logical graph called DAG into physical execution
plan with many stages. After converting into a physical execution plan, it creates
physical execution units called tasks under each stage. Then the tasks are bundled and
sent to the cluster.
• STEP 3: Now the driver talks to the cluster manager and negotiates the
resources. Cluster manager launches executors in worker nodes on behalf of the
driver. At this point, the driver will send the tasks to the executors based on data
placement. When executors start, they register themselves with drivers. So, the driver
will have a complete view of executors that are executing the task.
• STEP 4: During the course of execution of tasks, driver program will monitor the
set of executors that runs. Driver node also schedules future tasks based on data
placement. 
• This was all about Spark Architecture. Now, let’s get a hand’s on the working of a
Spark shell.
• Example using Scala in Spark shell
• At first, let’s start the Spark shell by assuming that Hadoop and Spark daemons
are up and running. Web UI port for Spark is localhost:4040.
Apache Storm
• Storm was originally created by Nathan Marz and team at BackType. BackType is a social analytics
company. Later, Storm was acquired and open-sourced by Twitter. In a short time, Apache Storm
became a standard for distributed real-time processing system that allows you to process large amount of
data, similar to Hadoop. Apache Storm is written in Java and Clojure. It is continuing to be a leader in
real-time analytics.
• What is Apache Storm?
• Apache Storm is a distributed real-time big data-processing system. Storm is designed to process vast
amount of data in a fault-tolerant and horizontal scalable method. It is a streaming data framework that
has the capability of highest ingestion rates. Though Storm is stateless, it manages distributed
environment and cluster state via Apache ZooKeeper. It is simple and you can execute all kinds of
manipulations on real-time data in parallel.
• Apache Storm is continuing to be a leader in real-time data analytics. Storm is easy to setup, operate and
it guarantees that every message will be processed through the topology at least once.
• Apache Storm vs Hadoop
• Basically Hadoop and Storm frameworks are used for analyzing big data. Both of them complement each
other and differ in some aspects. Apache Storm does all the operations except persistency, while Hadoop
is good at everything but lags in real-time computation. The following table compares the attributes of
Storm and Hadoop.

Storm Hadoop
Real-time stream processing Batch processing
Stateless Stateful
Master/Slave architecture with ZooKeeper based Master-slave architecture with/without ZooKeeper based
coordination. The master node is called as nimbus and coordination. Master node is job tracker and slave node
slaves are supervisors. is task tracker.

A Storm streaming process can access tens of thousands Hadoop Distributed File System (HDFS) uses
messages per second on cluster. MapReduce framework to process vast amount of data
that takes minutes or hours.

Storm topology runs until shutdown by the user or an MapReduce jobs are executed in a sequential order and
unexpected unrecoverable failure. completed eventually.

Both are distributed and fault-tolerant


If nimbus / supervisor dies, restarting makes it continue If the JobTracker dies, all the running jobs are lost.
from where it stopped, hence nothing gets affected.
• See why stateless is the choice for cloud architects. Stateful services keep track of sessions or
transactions and react differently to the same inputs based on that history. Stateless services
rely on clients to maintain sessions and center around operations that manipulate resources,
rather than the state.
• A stateless process or application can be understood in isolation. There is no stored knowledge
of or reference to past transactions. Each transaction is made as if from scratch for the first time.
• Stateful services are either a database or based on an internet protocol that needs a tight state
handling on a single host. The server processes requests based on the information relayed with
each request and information stored from earlier requests.
• Stateless is better than Stateful
• The Stateful protocol design makes the design of server very complex and heavy. Stateless
Protocols works better at the time of crash because there is no state that must be restored,
a failed server can simply restart after a crash.
• A stateless app is an application program that does not save client data
generated in one session for use in the next session with that client. ... In
contrast, a stateful application saves data about each client session and uses that
data the next time the client makes a request.
• Use-Cases of Apache Storm
• Apache Storm is very famous for real-time big data stream processing. For this
reason, most of the companies are using Storm as an integral part of their system.
Some notable examples are as follows −
• Twitter − Twitter is using Apache Storm for its range of “Publisher Analytics
products”. “Publisher Analytics Products” process each and every tweets and
clicks in the Twitter Platform. Apache Storm is deeply integrated with Twitter
infrastructure.
• NaviSite − NaviSite is using Storm for Event log monitoring/auditing system.
Every logs generated in the system will go through the Storm. Storm will check
the message against the configured set of regular expression and if there is a
match, then that particular message will be saved to the database.
• Wego − Wego is a travel metasearch engine located in Singapore. Travel related
data comes from many sources all over the world with different timing. Storm
helps Wego to search real-time data, resolves concurrency issues and find the best
match for the end-user.
• Apache Storm Benefits
• Here is a list of the benefits that Apache Storm offers −
• Storm is open source, robust, and user friendly. It could be utilized in small
companies as well as large corporations.
• Storm is fault tolerant, flexible, reliable, and supports any programming language.
• Allows real-time stream processing.
• Storm is unbelievably fast because it has enormous power of processing the data.
• Storm can keep up the performance even under increasing load by adding
resources linearly. It is highly scalable.
• Storm performs data refresh and end-to-end delivery response in seconds or
minutes depends upon the problem. It has very low latency.
• Storm has operational intelligence.
• Storm provides guaranteed data processing even if any of the connected nodes in
the cluster die or messages are lost.
Components Description
Tuple Tuple is the main data structure in Storm. It is a list of
ordered elements. By default, a Tuple supports all data
types. Generally, it is modelled as a set of comma
separated values and passed to a Storm cluster.
Stream Stream is an unordered sequence of tuples.
Spouts Source of stream. Generally, Storm accepts input data
from raw data sources like Twitter Streaming API, Apache
Kafka queue, Kestrel queue, etc. Otherwise you can write
spouts to read data from datasources. “ISpout" is the core
interface for implementing spouts. Some of the specific
interfaces are IRichSpout, BaseRichSpout, KafkaSpout,
etc.

Bolts Bolts are logical processing units. Spouts pass data to


bolts and bolts process and produce a new output stream.
Bolts can perform the operations of filtering, aggregation,
joining, interacting with data sources and databases. Bolt
receives data and emits to one or more bolts. “IBolt” is the
core interface for implementing bolts. Some of the
common interfaces are IRichBolt, IBasicBolt, etc.
• Apache Storm - Core Concepts
• Apache Storm reads raw stream of real-time data from one end and passes it through a sequence
of small processing units and output the processed / useful information at the other end.
• The following diagram depicts the core concept of Apache Storm.
Let’s take a real-time example of “Twitter Analysis” and see how it can be modelled in Apache
Storm. The following diagram depicts the structure.
• The input for the “Twitter Analysis” comes from Twitter Streaming API. Spout will read the
tweets of the users using Twitter Streaming API and output as a stream of tuples. A single tuple
from the spout will have a twitter username and a single tweet as comma separated values.
Then, this steam of tuples will be forwarded to the Bolt and the Bolt will split the tweet into
individual word, calculate the word count, and persist the information to a configured
datasource. Now, we can easily get the result by querying the datasource.
• Topology
• Spouts and bolts are connected together and they form a topology. Real-time application logic is
specified inside Storm topology. In simple words, a topology is a directed graph where vertices
are computation and edges are stream of data.
• A simple topology starts with spouts. Spout emits the data to one or more bolts. Bolt represents
a node in the topology having the smallest processing logic and the output of a bolt can be
emitted into another bolt as input.
• Storm keeps the topology always running, until you kill the topology. Apache Storm’s main job
is to run the topology and will run any number of topology at a given time.
• Tasks
• Now you have a basic idea on spouts and bolts. They are the smallest logical unit of
the topology and a topology is built using a single spout and an array of bolts. They
should be executed properly in a particular order for the topology to run successfully.
The execution of each and every spout and bolt by Storm is called as “Tasks”. In
simple words, a task is either the execution of a spout or a bolt. At a given time, each
spout and bolt can have multiple instances running in multiple separate threads.
• Workers
• A topology runs in a distributed manner, on multiple worker nodes. Storm spreads
the tasks evenly on all the worker nodes. The worker node’s role is to listen for jobs
and start or stop the processes whenever a new job arrives.
• Stream Grouping
• Stream of data flows from spouts to bolts or from one bolt to another bolt. Stream
grouping controls how the tuples are routed in the topology and helps us to
understand the tuples flow in the topology. There are four in-built groupings as
explained below.
Shuffle Grouping
In shuffle grouping, an equal number of tuples is distributed randomly across all of the workers
executing the bolts. The following diagram depicts the structure.
Field Grouping
The fields with same values in tuples are grouped together and the remaining tuples kept outside.
Then, the tuples with the same field values are sent forward to the same worker executing the
bolts. For example, if the stream is grouped by the field “word”, then the tuples with the same
string, “Hello” will move to the same worker. The following diagram shows how Field Grouping
works.
Global Grouping
All the streams can be grouped and forward to one bolt. This grouping sends tuples generated by
all instances of the source to a single target instance (specifically, pick the worker with lowest
ID).
All Grouping
All Grouping sends a single copy of each tuple to all instances of the receiving bolt. This kind of
grouping is used to send signals to bolts. All grouping is useful for join operations.
• Apache Storm - Cluster Architecture
One of the main highlight of the Apache Storm is that it is a fault-tolerant, fast with no “Single Point
of Failure” (SPOF) distributed application. We can install Apache Storm in as many systems as
needed to increase the capacity of the application.
• Let’s have a look at how the Apache Storm cluster is designed and its internal architecture. The
following diagram depicts the cluster design.
• Apache Storm has two type of nodes, Nimbus (master node) and Supervisor (worker node).
Nimbus is the central component of Apache Storm. The main job of Nimbus is to run the Storm
topology. Nimbus analyzes the topology and gathers the task to be executed. Then, it will
distributes the task to an available supervisor.
• A supervisor will have one or more worker process. Supervisor will delegate the tasks to worker
processes. Worker process will spawn as many executors as needed and run the task. Apache
Storm uses an internal distributed messaging system for the communication between nimbus
and supervisors.
Components Description
Nimbus Nimbus is a master node of Storm cluster. All other nodes in the cluster are called
as worker nodes. Master node is responsible for distributing data among all the
worker nodes, assign tasks to worker nodes and monitoring failures.

Supervisor The nodes that follow instructions given by the nimbus are called as Supervisors.
A supervisor has multiple worker processes and it governs worker processes to
complete the tasks assigned by the nimbus.
Worker process A worker process will execute tasks related to a specific topology. A worker process
will not run a task by itself, instead it creates executors and asks them to perform a
particular task. A worker process will have multiple executors.
Executor An executor is nothing but a single thread spawn by a worker process. An executor
runs one or more tasks but only for a specific spout or bolt.

Task A task performs actual data processing. So, it is either a spout or a bolt.
Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate
between themselves and maintaining shared data with robust synchronization
techniques. Nimbus is stateless, so it depends on ZooKeeper to monitor the working
ZooKeeper framework node status.
ZooKeeper helps the supervisor to interact with the nimbus. It is responsible to
maintain the state of nimbus and supervisor.
• Storm is stateless in nature. Even though stateless nature has its own
disadvantages, it actually helps Storm to process real-time data in the best possible
and quickest way.
• Storm is not entirely stateless though. It stores its state in Apache ZooKeeper.
Since the state is available in Apache ZooKeeper, a failed nimbus can be restarted
and made to work from where it left. Usually, service monitoring tools
like monit will monitor Nimbus and restart it if there is any failure.
• Apache Storm also have an advanced topology called Trident Topology with state
maintenance and it also provides a high-level API like Pig. We will discuss all
these features in the coming chapters.
• Apache Storm - Workflow
• A working Storm cluster should have one nimbus and one or more supervisors. Another important node is
Apache ZooKeeper, which will be used for the coordination between the nimbus and the supervisors.
• Let us now take a close look at the workflow of Apache Storm −
• Initially, the nimbus will wait for the “Storm Topology” to be submitted to it.
• Once a topology is submitted, it will process the topology and gather all the tasks that are to be carried out and
the order in which the task is to be executed.
• Then, the nimbus will evenly distribute the tasks to all the available supervisors.
• At a particular time interval, all supervisors will send heartbeats to the nimbus to inform that they are still alive.
• When a supervisor dies and doesn’t send a heartbeat to the nimbus, then the nimbus assigns the tasks to another
supervisor.
• When the nimbus itself dies, supervisors will work on the already assigned task without any issue.
• Once all the tasks are completed, the supervisor will wait for a new task to come in.
• In the meantime, the dead nimbus will be restarted automatically by service monitoring tools.
• The restarted nimbus will continue from where it stopped. Similarly, the dead supervisor can also be restarted
automatically. Since both the nimbus and the supervisor can be restarted automatically and both will continue as
before, Storm is guaranteed to process all the task at least once.
• Once all the topologies are processed, the nimbus waits for a new topology to arrive and similarly the supervisor
waits for new tasks.
• By default, there are two modes in a Storm cluster −
• Local mode − This mode is used for development, testing, and debugging because
it is the easiest way to see all the topology components working together. In this
mode, we can adjust parameters that enable us to see how our topology runs in
different Storm configuration environments. In Local mode, storm topologies run
on the local machine in a single JVM.
• Production mode − In this mode, we submit our topology to the working storm
cluster, which is composed of many processes, usually running on different
machines. As discussed in the workflow of storm, a working cluster will run
indefinitely until it is shut down.
• Storm - Distributed Messaging System
• Apache Storm processes real-time data and the input normally comes from a message queuing system. An
external distributed messaging system will provide the input necessary for the realtime computation.
Spout will read the data from the messaging system and convert it into tuples and input into the Apache
Storm. The interesting fact is that Apache Storm uses its own distributed messaging system internally for
the communication between its nimbus and supervisor.
• What is Distributed Messaging System?
• Distributed messaging is based on the concept of reliable message queuing. Messages are queued
asynchronously between client applications and messaging systems. A distributed messaging system
provides the benefits of reliability, scalability, and persistence.
• Most of the messaging patterns follow the publish-subscribe model (simply Pub-Sub) where the senders
of the messages are called publishers and those who want to receive the messages are called subscribers.
• Once the message has been published by the sender, the subscribers can receive the selected message with
the help of a filtering option. Usually we have two types of filtering, one is topic-based filtering and
another one is content-based filtering.
• Note that the pub-sub model can communicate only via messages. It is a very loosely coupled
architecture; even the senders don’t know who their subscribers are. Many of the message patterns enable
with message broker to exchange publish messages for timely access by many subscribers. A real-life
example is Dish TV, which publishes different channels like sports, movies, music, etc., and anyone can
subscribe to their own set of channels and get them whenever their subscribed channels are available.
Distributed messaging system Description

Apache Kafka Kafka was developed at LinkedIn corporation and later it became a
sub-project of Apache. Apache Kafka is based on brokerenabled,
persistent, distributed publish-subscribe model. Kafka is fast,
scalable, and highly efficient.

RabbitMQ RabbitMQ is an open source distributed robust messaging


application. It is easy to use and runs on all platforms.

JMS(Java Message Service) JMS is an open source API that supports creating, reading, and
sending messages from one application to another. It provides
guaranteed message delivery and follows publish-subscribe model.

ActiveMQ ActiveMQ messaging system is an open source API of JMS.

ZeroMQ ZeroMQ is broker-less peer-peer message processing. It provides


push-pull, router-dealer message patterns.

Kestrel Kestrel is a fast, reliable, and simple distributed message queue.


• Thrift Protocol
• Thrift was built at Facebook for cross-language services development and remote
procedure call (RPC). Later, it became an open source Apache project. Apache
Thrift is an Interface Definition Language and allows to define new data types
and services implementation on top of the defined data types in an easy manner.
• Apache Thrift is also a communication framework that supports embedded
systems, mobile applications, web applications, and many other programming
languages. Some of the key features associated with Apache Thrift are its
modularity, flexibility, and high performance. In addition, it can perform
streaming, messaging, and RPC in distributed applications.
• Storm extensively uses Thrift Protocol for its internal communication and data
definition. Storm topology is simply Thrift Structs. Storm Nimbus that runs the
topology in Apache Storm is a Thrift service.
• What is Apache Kafka
• Apache Kafka is a software platform which is based on a distributed streaming process. It is a
publish-subscribe messaging system which let exchanging of data between applications,
servers, and processors as well. Apache Kafka was originally developed by LinkedIn, and later
it was donated to the Apache Software Foundation. Currently, it is maintained
by Confluent under Apache Software Foundation. Apache Kafka has resolved the lethargic
trouble of data communication between a sender and a receiver.
• What is a messaging system
• A messaging system is a simple exchange of messages between two or more persons, devices,
etc. A publish-subscribe messaging system allows a sender to send/write the message and a
receiver to read that message. In Apache Kafka, a sender is known as a producer who publishes
messages, and a receiver is known as a consumer who consumes that message by subscribing
it.
• What is Streaming process
• A streaming process is the processing of data in parallelly connected systems. This process
allows different applications to limit the parallel execution of the data, where one record
executes without waiting for the output of the previous record. Therefore, a distributed
streaming platform enables the user to simplify the task of the streaming process and parallel
execution. Therefore, a streaming platform in Kafka has the following key capabilities:
• As soon as the streams of records occur, it processes it.
• It works similar to an enterprise messaging system where it publishes and subscribes streams of
records.
• It stores the streams of records in a fault-tolerant durable way.
To learn and understand Apache Kafka, the aspirants should know the following four core APIs :
• Producer API: This API allows/permits an application to publish streams of records to one or
more topics. (discussed in later section)
• Consumer API: This API allows an application to subscribe one or more topics and process the
stream of records produced to them.
• Streams API: This API allows an application to effectively transform the input streams to the
output streams. It permits an application to act as a stream processor which consumes an input
stream from one or more topics, and produce an output stream to one or more output topics.
• Connector API: This API executes the reusable producer and consumer APIs with the existing
data systems or applications.
• Why Apache Kafka
• Apache Kafka is a software platform that has the following reasons which best
describes the need of Apache Kafka.
• Apache Kafka is capable of handling millions of data or messages per second.
• Apache Kafka works as a mediator between the source system and the target system. Thus, the
source system (producer) data is sent to the Apache Kafka, where it decouples the data, and the
target system (consumer) consumes the data from Kafka.
• Apache Kafka is having extremely high performance, i.e., it has really low latency value less
than 10ms which proves it as a well-versed software.
• Apache Kafka has a resilient architecture which has resolved unusual complications in data
sharing.
• Organizations such as NETFLIX, UBER, Walmart, etc. and over thousands of such firms make
use of Apache Kafka.
• Apache Kafka is able to maintain the fault-tolerance. Fault-tolerance means that sometimes a
consumer successfully consumes the message that was delivered by the producer. But, the
consumer fails to process the message back due to backend database failure, or due to presence
of a bug in the consumer code. In such a situation, the consumer is unable to consume the
message again. Consequently, Apache Kafka has resolved the problem by reprocessing the data.
• Learning Kafka is a good source of income. So, those who wish to raise their income in future
in IT sector can learn.
• Kafka Topics
• In the previous section, we have taken a brief introduction about Apache Kafka,
messaging system, as well as the streaming process. Here, we will discuss the
basic concepts and the role of Kafka.
• Topics
• Generally, a topic refers to a particular heading or a name given to some specific
inter-related ideas. In Kafka, the word topic refers to a category or a common
name used to store and publish a particular stream of data. Basically, topics in
Kafka are similar to tables in the database, but not containing all constraints. In
Kafka, we can create n number of topics as we want. It is identified by its name,
which depends on the user's choice. A producer publishes data to the topics, and a
consumer reads that data from the topic by subscribing it.
• Partitions
• A topic is split into several parts which are known as the partitions of the topic.
These partitions are separated in an order. The data content gets stored in the
partitions within the topic. Therefore, while creating a topic, we need to specify
the number of partitions(the number is arbitrary and can be changed later). Each
message gets stored into partitions with an incremental id known as its Offset
value. The order of the offset value is guaranteed within the partition only and
not across the partition. The offsets for a partition are infinite.
• Note: The data once written to a partition can never be changed. It is immutable.
The offset value always remains in an incremental state, it never goes back to an
empty space. Also, the data is kept in a partition for a limited time only.
• Let's see an example to understand a topic with its partitions.
Suppose, a topic containing three partitions 0,1 and 2. Each partition has different offset
numbers. The data is distributed among each offset in each partition where data in offset 1 of
Partition 0 does not have any relation with the data in offset 1 of Partition1. But, data in offset
1of Partition 0 is inter-related with the data contained in offset 2 of Partition0.
Brokers
Here, comes the role of Apache Kafka.
A Kafka cluster is comprised of one or more servers which are known as brokers or Kafka
brokers. A broker is a container that holds several topics with their multiple partitions. The
brokers in the cluster are identified by an integer id only. Kafka brokers are also known
as Bootstrap brokers because connection with any one broker means connection with the entire
cluster. Although a broker does not contain whole data, but each broker in the cluster knows about
all other brokers, partitions as well as topics.

This is how a broker looks like in the figure containing a topic with n number of partitions.
Example: Brokers and Topics
Suppose, a Kafka cluster consisting of three brokers, namely Broker 1, Broker 2, and Broker 3.

Each broker is holding a topic, namely Topic-x with three partitions 0,1 and 2. Remember, all
partitions do not belong to one broker only, it is always distributed among each broker (depends
on the quantity). Broker 1 and Broker 2 contains another topic-y having two partitions 0 and 1.
Thus, Broker 3 does not hold any data from Topic-y. It is also concluded that no relationship ever
exists between the broker number and the partition number.
• Kafka Topic Replication
• Apache Kafka is a distributed software system in the Big Data world. Thus, for
such a system, there is a requirement to have copies of the stored data. In Kafka,
each broker contains some sort of data. But, what if the broker or the machine
fails down? The data will be lost. Precautionary, Apache Kafka enables a feature
of replication to secure data loss even when a broker fails down. To do so,
a replication factor is created for the topics contained in any particular broker. A
replication factor is the number of copies of data over multiple brokers. The
replication factor value should be greater than 1 always (between 2 or 3). This
helps to store a replica of the data in another broker from where the user can
access it.
• For example, suppose we have a cluster containing three brokers say Broker 1,
Broker 2, and Broker 3. A topic, namely Topic-X is split into Partition 0 and
Partition 1 with a replication factor of 2.
Thus, we can see that Partition 0 of Topic-x is having its replicas in Broker 1 and Broker 2. Also,
Partition1 of Topic-x is having its replication in Broker 2 and Broker 3.
It is obvious to have confusion when both the actual data and its replicas are present. The cluster may
get confuse that which broker should serve the client request. To remove such confusion, the
following task is done by Kafka:
•It chooses one of the broker's partition as a leader, and the rest of them becomes its followers.
•The followers(brokers) will be allowed to synchronize the data. But, in the presence of
a leader, none of the followers is allowed to serve the client's request. These replicas are
known as ISR(in-sync-replica). So, Apache Kafka offers multiple ISR(in-sync-replica) for the
• Therefore, only the leader is allowed to serve the client request. The leader
handles all the read and writes operations of data for the partitions. The leader
and its followers are determined by the zookeeper(discussed later).
• If the broker holding the leader for the partition fails to serve the data due to any
failure, one of its respective ISR replicas will takeover the leadership. Afterward,
if the previous leader returns back, it tries to acquire its leadership again.
• Let's see an example to understand the concept of leader and its followers.
• Suppose, a cluster with the following three brokers 1,2, and 3. A topic x is
present having two partitions and with replication factor=2.
So, to remove the confusion, Partition-0 under Broker 1 is provided with the leadership. Thus, it
is the leader and Partition 0 under Broker 2 will become its replica or ISR. Similarly, Partition 1
under Broker 2 is the leader and Partition 1 under Broker 3 is its replica or ISR. In case, Broker 1
fails to serve, Broker 2 with Partition 0 replica will become the leader.
Kafka Producers
A producer is the one which publishes or writes data to the topics within different partitions.
Producers automatically know that, what data should be written to which partition and broker.
The user does not require to specify the broker and the partition.
• How does the producer write data to the cluster?
• A producer uses following strategie//s to write data to the cluster:
• Message Keys
• Acknowledgment
• Message Keys
• Apache Kafka enables the concept of the key to send the messages in a specific order.
The key enables the producer with two choices, i.e., either to send data to each
partition (automatically) or send data to a specific partition only. Sending data to some
specific partitions is possible with the message keys. If the producers apply key over
the data, that data will always be sent to the same partition always. But, if the producer
does not apply the key while writing the data, it will be sent in a round-robin manner.
This process is called load balancing. In Kafka, load balancing is done when the
producer writes data to the Kafka topic without specifying any key, Kafka distributes
little-little bit data to each partition.
• Therefore, a message key can be a string, number, or anything as we wish.
• There are two ways to know that the data is sent with or without a key:
• If the value of key=NULL, it means that the data is sent without a key. Thus, it will
be distributed in a round-robin manner (i.e., distributed to each partition).
• If the value of the key!=NULL, it means the key is attached with the data, and thus
all messages will always be delivered to the same partition.
• Let's see an example
• Consider a scenario where a producer writes data to the Kafka cluster, and the
data is written without specifying the key. So, the data gets distributed among
each partition of Topic-T under each broker, i.e., Broker 1, Broker2, and Broker 3.
Consider another scenario where a producer specifies a key as Prod_id. So, data of
Prod_id_1(say) will always be sent to partition 0 under Broker 1, and data of Prod_id_2 will
always be in partition 1 under Broker 2. Thus, the data will not be distributed to each partition
after applying the key (as saw in the above scenario).
• Acknowledgment
• In order to write data to the Kafka cluster, the producer has another choice of
acknowledgment. It means the producer can get a confirmation of its data writes by
receiving the following acknowledgments:
• acks=0: This means that the producer sends the data to the broker but does not wait
for the acknowledgement. This leads to possible data loss because without confirming
that the data is successfully sent to the broker or may be the broker is down, it sends
another one.
• acks=1: This means that the producer will wait for the leader's acknowledgement. The
leader asks the broker whether it successfully received the data, and then returns
feedback to the producer. In such case, there is limited data loss only.
• acks=all: Here, the acknowledgment is done by both the leader and its followers.
When they successfully acknowledge the data, it means the data is successfully
received. In this case, there is no data loss.
• Let' see an example
• Suppose, a producer writes data to Broker1, Broker 2, and Broker 3.
• Case1: Producer sends data to each of the Broker, but not receiving any
acknowledgment. Therefore, there can be a severe data loss, and the correct data
could not be conveyed to the consumers.
Case2: The producers send data to the brokers. Broker 1 holds the leader. Thus, the leader asks
Broker 1 whether it has successfully received data. After receiving the Broker's confirmation, the
leader sends the feedback to the Producer with ack=1.
Case3: The producers send data to each broker. Now, the leader and its replica/ISR will ask
their respective brokers about the data. Finally, acknowledge the producer with the feedback.

Note: In the above figure, Broker 1 and Broker 2 has successfully received the data. Thus, both
brokers have responded 'Yes' to its respective topics.

You might also like