0% found this document useful (0 votes)
122 views5 pages

Big Data and Hadoop - Suzanne

This document provides an introduction to big data concepts like volume, velocity, variety, variability and veracity. It then discusses big data analytics and some common tools used for big data like Hadoop, HDFS, MapReduce, YARN, Sqoop, Flume, Pig and Hive. It provides brief explanations of how each tool works and its role in distributed data processing on Hadoop clusters. For example, it explains that HDFS is a distributed file system, MapReduce performs distributed computations, YARN is the resource manager, Sqoop transfers data between Hadoop and relational databases, Flume ingests streaming data, Pig provides a scripting language, and Hive is a data warehouse.

Uploaded by

Tripti Sagar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views5 pages

Big Data and Hadoop - Suzanne

This document provides an introduction to big data concepts like volume, velocity, variety, variability and veracity. It then discusses big data analytics and some common tools used for big data like Hadoop, HDFS, MapReduce, YARN, Sqoop, Flume, Pig and Hive. It provides brief explanations of how each tool works and its role in distributed data processing on Hadoop clusters. For example, it explains that HDFS is a distributed file system, MapReduce performs distributed computations, YARN is the resource manager, Sqoop transfers data between Hadoop and relational databases, Flume ingests streaming data, Pig provides a scripting language, and Hive is a data warehouse.

Uploaded by

Tripti Sagar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Cloud and Big Data Experiential Learning Report

Submitted by:

Suzanne Viju Cherian


HR C
Roll no. - 42302
Big Data and Hadoop

Introduction to Big Data


The term Big Data refers to all the data that is being generated across the globe at an unprecedented
rate. The concept of Big data is articulated in its definition as the 5 V’s :

1. Volume: Organizations collect data from a variety of sources, including business transactions,
smart (IoT) devices, industrial equipment, videos, social media and more.
2. Velocity: With the growth in the IoT, data streams in to businesses at an unprecedented speed
and must be handled in a timely manner.
3. Variety: Data comes in all types of formats – from structured, numeric data in traditional
databases to unstructured text documents, emails, videos, audios, stock ticker data and
financial transactions.
4. Variability: Data flows are unpredictable – changing often and varying greatly.
5. Veracity: Veracity refers to the quality of data. Because data comes from so many different
sources, it’s difficult to link, match, cleanse and transform data across systems.

Big Data Analytics

Big data analytics is the complex process of examining large and varied data sets, or big data, to
uncover information such as hidden patterns, unknown correlations, market trends and customer
preferences, that can help organizations make informed business decisions.

Big data analytics technologies and tools


Unstructured and semi-structured data types don't fit well in traditional data warehouses that are
based on relational databases oriented to structured data sets. Organizations that collect, process
and analyze big data use NoSQL databases, as well as Hadoop and its companion data analytics tools,
including:

YARN MapReduce Sqoop HBase Hive Flume Pig

Apache Hadoop and Ecosystem

Apache Hadoop is an open source software


framework used to develop data processing
applications which are executed in a distributed
computing environment.

Applications built using HADOOP are run on large


data sets distributed across clusters of
commodity computers. Commodity computers
are cheap and widely available. These are mainly
useful for achieving greater computational power at low cost. The picture above shows the various
components of the Hadoop ecosystem. For the sake of brevity, some of them have been explained in
this report.
Hadoop Architecture

Hadoop has a Master-Slave Architecture for data storage and distributed data processing
using MapReduce and HDFS methods.

Name Node represents every file and directory


which is used in the namespace

Data Node helps manage the state of an HDFS


node and allows to interacts with the blocks.

Master node allows to conduct parallel


processing of data using Hadoop MapReduce.

Slave nodes are the additional machines in the Hadoop cluster which allows to store data to conduct
complex calculations. Moreover, all the slave node comes with Task Tracker and a Data Node. This
allows to synchronize the processes with the Name Node and Job Tracker respectively. In Hadoop,
master or slave system can be set up in the cloud or on-premise

HDFS is a distributed file system that handles large data sets running on commodity hardware. It is
used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes.

MapReduce refers to two distinct tasks that Hadoop programs perform. The first is the map job,
which takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and
combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce
implies, the reduce job is always performed after the map job.

YARN stands for Yet Another Resource Negotiator which is called the cluster management system
of Hadoop which was introduced with Hadoop 2.0 to support distributed computing which also
improves the implementation of MapReduce. In YARN, though we have data nodes there are no
longer Task Trackers or Job Trackers.

YARN Architecture
1) Resource Manager : It manages the
resources used across the cluster and the Node
Manager lunches and monitors the containers.
Two components of the Resource Manager are:
Scheduler: Allocates resources to the running
applications based on the capacity and queue.
Application Manager: Manages the running of
Application Master in a cluster and on the
failure of the Application Master Container,
helps in restarting it.

2) Node Manager : Node Manager is responsible for the execution of the task in each data node.

3) Containers : The Containers are set of resources like RAM, CPU, and Memory etc on a single node
and they are scheduled by Resource Manager and monitored by Node Manager.
4) Application Master : It monitors the execution of tasks and also manages the lifecycle of
applications running on the cluster.

In order to run an application through YARN, the below steps are performed.

 The client contacts the Resource Manager (RM) which submits the YARN application.
 RM searches for a Node Manager to launch the Application Master in a container.
 The Application Master can either run the execution in the container in which it is running
currently and provide the result to the client or it can request more containers from resource
manager which can be called distributed computing.
 The client then contacts the Resource Manager to monitor the status of the application.

Apache Sqoop

There was a need of a tool which could import and export data from relational databases. This is why
Apache Sqoop was born. Sqoop can easily integrate with Hadoop and dump structured data from
relational databases on HDFS, complimenting the power of Hadoop.

When we submit Sqoop command, our main task gets divided into subtasks which is handled by
individual Map Task internally. Map Task is the subtask, which imports part of data to the Hadoop
Ecosystem. Collectively, all Map tasks import the whole data. Export also works in a similar manner.
The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop
contain records, which are called as rows in the table.

Apache Flume

Apache Flume is a tool for data ingestion


in HDFS. It collects, aggregates and
transports large amount of streaming
data such as log files, events from various
sources like network traffic, social media,
email messages etc. to HDFS. Flume is a
highly reliable & distributed.

The flume agent has 3 components: source, sink and channel.

Source: It accepts the data from the incoming streamline and stores the data in the channel.
Channel: It acts as the local storage or a temporary storage between the source of data and
persistent data in the HDFS.
Sink: It collects the data from the channel and commits or writes the data in the HDFS permanently.
Apache Pig

Apache Pig is an abstraction over MapReduce. It is a tool/platform which is


used to analyze larger sets of data representing them as data flows. Pig is
generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Apache Pig. To analyze data using Apache Pig,
programmers need to write scripts using Pig Latin language. All these scripts
are internally converted to Map and Reduce tasks. Apache Pig has a
component known as Pig Engine that accepts the Pig Latin scripts as input
and converts those scripts into MapReduce jobs.

Apache Hive

Apache Hive, an open-source data


warehouse system, is used with Apache Pig
for loading and transforming unstructured,
structured, or semi-structured data for data
analysis and getting better business
insights. Pig, a standard ETL scripting
language, is used to export and import data
into Apache Hive and to process large
number of datasets. Pig can be used for ETL
data pipeline and iterative processing.

Apache HBase
HBase is an open source, multidimensional, distributed, scalable and a NoSQL database written in
Java. HBase runs on top of HDFS and provides BigTable like capabilities to Hadoop. It is designed to
provide a fault tolerant way of storing large collection of sparse data sets.

The Big Data Engineer

Big Data Engineer builds what the big data solutions architect has designed. Big data engineers
develop, maintain, test and evaluate big data solutions within organisations. Most of the time they
are also involved in the design of big data solutions, because of the experience they have with Hadoop
based technologies such as MapReduce, Hive MongoDB or Cassandra. A big data engineer builds
large-scale data processing systems, is an expert in data warehousing solutions and should be able to
work with the latest (NoSQL) database technologies. The below figure depicts a learning path for big
data engineer.

You might also like