Big Data and Hadoop - Suzanne
Big Data and Hadoop - Suzanne
Submitted by:
1. Volume: Organizations collect data from a variety of sources, including business transactions,
smart (IoT) devices, industrial equipment, videos, social media and more.
2. Velocity: With the growth in the IoT, data streams in to businesses at an unprecedented speed
and must be handled in a timely manner.
3. Variety: Data comes in all types of formats – from structured, numeric data in traditional
databases to unstructured text documents, emails, videos, audios, stock ticker data and
financial transactions.
4. Variability: Data flows are unpredictable – changing often and varying greatly.
5. Veracity: Veracity refers to the quality of data. Because data comes from so many different
sources, it’s difficult to link, match, cleanse and transform data across systems.
Big data analytics is the complex process of examining large and varied data sets, or big data, to
uncover information such as hidden patterns, unknown correlations, market trends and customer
preferences, that can help organizations make informed business decisions.
Hadoop has a Master-Slave Architecture for data storage and distributed data processing
using MapReduce and HDFS methods.
Slave nodes are the additional machines in the Hadoop cluster which allows to store data to conduct
complex calculations. Moreover, all the slave node comes with Task Tracker and a Data Node. This
allows to synchronize the processes with the Name Node and Job Tracker respectively. In Hadoop,
master or slave system can be set up in the cloud or on-premise
HDFS is a distributed file system that handles large data sets running on commodity hardware. It is
used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes.
MapReduce refers to two distinct tasks that Hadoop programs perform. The first is the map job,
which takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and
combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce
implies, the reduce job is always performed after the map job.
YARN stands for Yet Another Resource Negotiator which is called the cluster management system
of Hadoop which was introduced with Hadoop 2.0 to support distributed computing which also
improves the implementation of MapReduce. In YARN, though we have data nodes there are no
longer Task Trackers or Job Trackers.
YARN Architecture
1) Resource Manager : It manages the
resources used across the cluster and the Node
Manager lunches and monitors the containers.
Two components of the Resource Manager are:
Scheduler: Allocates resources to the running
applications based on the capacity and queue.
Application Manager: Manages the running of
Application Master in a cluster and on the
failure of the Application Master Container,
helps in restarting it.
2) Node Manager : Node Manager is responsible for the execution of the task in each data node.
3) Containers : The Containers are set of resources like RAM, CPU, and Memory etc on a single node
and they are scheduled by Resource Manager and monitored by Node Manager.
4) Application Master : It monitors the execution of tasks and also manages the lifecycle of
applications running on the cluster.
In order to run an application through YARN, the below steps are performed.
The client contacts the Resource Manager (RM) which submits the YARN application.
RM searches for a Node Manager to launch the Application Master in a container.
The Application Master can either run the execution in the container in which it is running
currently and provide the result to the client or it can request more containers from resource
manager which can be called distributed computing.
The client then contacts the Resource Manager to monitor the status of the application.
Apache Sqoop
There was a need of a tool which could import and export data from relational databases. This is why
Apache Sqoop was born. Sqoop can easily integrate with Hadoop and dump structured data from
relational databases on HDFS, complimenting the power of Hadoop.
When we submit Sqoop command, our main task gets divided into subtasks which is handled by
individual Map Task internally. Map Task is the subtask, which imports part of data to the Hadoop
Ecosystem. Collectively, all Map tasks import the whole data. Export also works in a similar manner.
The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop
contain records, which are called as rows in the table.
Apache Flume
Source: It accepts the data from the incoming streamline and stores the data in the channel.
Channel: It acts as the local storage or a temporary storage between the source of data and
persistent data in the HDFS.
Sink: It collects the data from the channel and commits or writes the data in the HDFS permanently.
Apache Pig
Apache Hive
Apache HBase
HBase is an open source, multidimensional, distributed, scalable and a NoSQL database written in
Java. HBase runs on top of HDFS and provides BigTable like capabilities to Hadoop. It is designed to
provide a fault tolerant way of storing large collection of sparse data sets.
Big Data Engineer builds what the big data solutions architect has designed. Big data engineers
develop, maintain, test and evaluate big data solutions within organisations. Most of the time they
are also involved in the design of big data solutions, because of the experience they have with Hadoop
based technologies such as MapReduce, Hive MongoDB or Cassandra. A big data engineer builds
large-scale data processing systems, is an expert in data warehousing solutions and should be able to
work with the latest (NoSQL) database technologies. The below figure depicts a learning path for big
data engineer.