0% found this document useful (0 votes)

122 views5 pages

Big Data and Hadoop - Suzanne

This document provides an introduction to big data concepts like volume, velocity, variety, variability and veracity. It then discusses big data analytics and some common tools used for big data like Hadoop, HDFS, MapReduce, YARN, Sqoop, Flume, Pig and Hive. It provides brief explanations of how each tool works and its role in distributed data processing on Hadoop clusters. For example, it explains that HDFS is a distributed file system, MapReduce performs distributed computations, YARN is the resource manager, Sqoop transfers data between Hadoop and relational databases, Flume ingests streaming data, Pig provides a scripting language, and Hive is a data warehouse.

Uploaded by

Tripti Sagar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views5 pages

Big Data and Hadoop - Suzanne

Uploaded by

Tripti Sagar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Cloud and Big Data Experiential Learning Report

Submitted by:

Suzanne Viju Cherian

HR C
Roll no. - 42302
Big Data and Hadoop

Introduction to Big Data

The term Big Data refers to all the data that is being generated across the globe at an unprecedented
rate. The concept of Big data is articulated in its definition as the 5 V’s :

1. Volume: Organizations collect data from a variety of sources, including business transactions,
smart (IoT) devices, industrial equipment, videos, social media and more.
2. Velocity: With the growth in the IoT, data streams in to businesses at an unprecedented speed
and must be handled in a timely manner.
3. Variety: Data comes in all types of formats – from structured, numeric data in traditional
databases to unstructured text documents, emails, videos, audios, stock ticker data and
financial transactions.
4. Variability: Data flows are unpredictable – changing often and varying greatly.
5. Veracity: Veracity refers to the quality of data. Because data comes from so many different
sources, it’s difficult to link, match, cleanse and transform data across systems.

Big Data Analytics

Big data analytics is the complex process of examining large and varied data sets, or big data, to
uncover information such as hidden patterns, unknown correlations, market trends and customer
preferences, that can help organizations make informed business decisions.

Big data analytics technologies and tools

Unstructured and semi-structured data types don't fit well in traditional data warehouses that are
based on relational databases oriented to structured data sets. Organizations that collect, process
and analyze big data use NoSQL databases, as well as Hadoop and its companion data analytics tools,
including:

YARN MapReduce Sqoop HBase Hive Flume Pig

Apache Hadoop and Ecosystem

Apache Hadoop is an open source software

framework used to develop data processing
applications which are executed in a distributed
computing environment.

Applications built using HADOOP are run on large

data sets distributed across clusters of
commodity computers. Commodity computers
are cheap and widely available. These are mainly
useful for achieving greater computational power at low cost. The picture above shows the various
components of the Hadoop ecosystem. For the sake of brevity, some of them have been explained in
this report.
Hadoop Architecture

Hadoop has a Master-Slave Architecture for data storage and distributed data processing
using MapReduce and HDFS methods.

Name Node represents every file and directory

which is used in the namespace

Data Node helps manage the state of an HDFS

node and allows to interacts with the blocks.

Master node allows to conduct parallel

processing of data using Hadoop MapReduce.

Slave nodes are the additional machines in the Hadoop cluster which allows to store data to conduct
complex calculations. Moreover, all the slave node comes with Task Tracker and a Data Node. This
allows to synchronize the processes with the Name Node and Job Tracker respectively. In Hadoop,
master or slave system can be set up in the cloud or on-premise

HDFS is a distributed file system that handles large data sets running on commodity hardware. It is
used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes.

MapReduce refers to two distinct tasks that Hadoop programs perform. The first is the map job,
which takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and
combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce
implies, the reduce job is always performed after the map job.

YARN stands for Yet Another Resource Negotiator which is called the cluster management system
of Hadoop which was introduced with Hadoop 2.0 to support distributed computing which also
improves the implementation of MapReduce. In YARN, though we have data nodes there are no
longer Task Trackers or Job Trackers.

YARN Architecture
1) Resource Manager : It manages the
resources used across the cluster and the Node
Manager lunches and monitors the containers.
Two components of the Resource Manager are:
Scheduler: Allocates resources to the running
applications based on the capacity and queue.
Application Manager: Manages the running of
Application Master in a cluster and on the
failure of the Application Master Container,
helps in restarting it.

2) Node Manager : Node Manager is responsible for the execution of the task in each data node.

3) Containers : The Containers are set of resources like RAM, CPU, and Memory etc on a single node
and they are scheduled by Resource Manager and monitored by Node Manager.
4) Application Master : It monitors the execution of tasks and also manages the lifecycle of
applications running on the cluster.

In order to run an application through YARN, the below steps are performed.

 The client contacts the Resource Manager (RM) which submits the YARN application.
 RM searches for a Node Manager to launch the Application Master in a container.
 The Application Master can either run the execution in the container in which it is running
currently and provide the result to the client or it can request more containers from resource
manager which can be called distributed computing.
 The client then contacts the Resource Manager to monitor the status of the application.

Apache Sqoop

There was a need of a tool which could import and export data from relational databases. This is why
Apache Sqoop was born. Sqoop can easily integrate with Hadoop and dump structured data from
relational databases on HDFS, complimenting the power of Hadoop.

When we submit Sqoop command, our main task gets divided into subtasks which is handled by
individual Map Task internally. Map Task is the subtask, which imports part of data to the Hadoop
Ecosystem. Collectively, all Map tasks import the whole data. Export also works in a similar manner.
The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop
contain records, which are called as rows in the table.

Apache Flume

Apache Flume is a tool for data ingestion

in HDFS. It collects, aggregates and
transports large amount of streaming
data such as log files, events from various
sources like network traffic, social media,
email messages etc. to HDFS. Flume is a
highly reliable & distributed.

The flume agent has 3 components: source, sink and channel.

Source: It accepts the data from the incoming streamline and stores the data in the channel.
Channel: It acts as the local storage or a temporary storage between the source of data and
persistent data in the HDFS.
Sink: It collects the data from the channel and commits or writes the data in the HDFS permanently.
Apache Pig

Apache Pig is an abstraction over MapReduce. It is a tool/platform which is

used to analyze larger sets of data representing them as data flows. Pig is
generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Apache Pig. To analyze data using Apache Pig,
programmers need to write scripts using Pig Latin language. All these scripts
are internally converted to Map and Reduce tasks. Apache Pig has a
component known as Pig Engine that accepts the Pig Latin scripts as input
and converts those scripts into MapReduce jobs.

Apache Hive

Apache Hive, an open-source data

warehouse system, is used with Apache Pig
for loading and transforming unstructured,
structured, or semi-structured data for data
analysis and getting better business
insights. Pig, a standard ETL scripting
language, is used to export and import data
into Apache Hive and to process large
number of datasets. Pig can be used for ETL
data pipeline and iterative processing.

Apache HBase
HBase is an open source, multidimensional, distributed, scalable and a NoSQL database written in
Java. HBase runs on top of HDFS and provides BigTable like capabilities to Hadoop. It is designed to
provide a fault tolerant way of storing large collection of sparse data sets.

The Big Data Engineer

Big Data Engineer builds what the big data solutions architect has designed. Big data engineers
develop, maintain, test and evaluate big data solutions within organisations. Most of the time they
are also involved in the design of big data solutions, because of the experience they have with Hadoop
based technologies such as MapReduce, Hive MongoDB or Cassandra. A big data engineer builds
large-scale data processing systems, is an expert in data warehousing solutions and should be able to
work with the latest (NoSQL) database technologies. The below figure depicts a learning path for big
data engineer.

New CS WorkBook 2023 (Computer Science) (Editable)
No ratings yet
New CS WorkBook 2023 (Computer Science) (Editable)
94 pages
Fntdb07 Architecture of A Database System
100% (4)
Fntdb07 Architecture of A Database System
119 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
SINUMERIK 840D/810D: Description of Functions 12.2001 Edition
No ratings yet
SINUMERIK 840D/810D: Description of Functions 12.2001 Edition
206 pages
C C++
100% (1)
C C++
93 pages
DBMS Concurrency Control - Two Phase, Timestamp, Lock-Based Protocol
No ratings yet
DBMS Concurrency Control - Two Phase, Timestamp, Lock-Based Protocol
11 pages
Easeus Partition Master User Guide
100% (1)
Easeus Partition Master User Guide
50 pages
Distributed Dbms 2170714 Lab Manual
No ratings yet
Distributed Dbms 2170714 Lab Manual
50 pages
This Study Resource Was: Section 3 Quiz 1 - L1-L2
100% (1)
This Study Resource Was: Section 3 Quiz 1 - L1-L2
6 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
HCM Global Absence FF Guide 2209568
No ratings yet
HCM Global Absence FF Guide 2209568
105 pages
Do Project Managers Have Different Perspectives On Project Management?
No ratings yet
Do Project Managers Have Different Perspectives On Project Management?
8 pages
Data W - Bigdata8
No ratings yet
Data W - Bigdata8
105 pages
Global Enterprise Agile Transformation Service Market 2019-2026 PDF
No ratings yet
Global Enterprise Agile Transformation Service Market 2019-2026 PDF
298 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Module 1 Glossary What Is Big Data
No ratings yet
Module 1 Glossary What Is Big Data
2 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Spark 4-2 Documentation
No ratings yet
Spark 4-2 Documentation
60 pages
Introduc) On To Bigdata
No ratings yet
Introduc) On To Bigdata
103 pages
PerconaXtraBackup-2 4 21
No ratings yet
PerconaXtraBackup-2 4 21
162 pages
Employee Management System Abstract:: Project Design: A. Database Design
No ratings yet
Employee Management System Abstract:: Project Design: A. Database Design
6 pages
Experiences Running Apache Flink at Very Large Scale: @stephanewen Berlin Buzzwords, 2017
No ratings yet
Experiences Running Apache Flink at Very Large Scale: @stephanewen Berlin Buzzwords, 2017
76 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Big Data and BDA
No ratings yet
Big Data and BDA
44 pages
Chapter3 - Basic Processing Unit
No ratings yet
Chapter3 - Basic Processing Unit
47 pages
RFC 1006
No ratings yet
RFC 1006
39 pages
Big Data QB
No ratings yet
Big Data QB
37 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Assignment Questions BDA Lec 6
No ratings yet
Assignment Questions BDA Lec 6
51 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Ethernet-A Brief History
No ratings yet
Ethernet-A Brief History
2 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Unit 2
No ratings yet
Unit 2
56 pages
Protocol Description MD 1809201620 Ver2017
No ratings yet
Protocol Description MD 1809201620 Ver2017
51 pages
Hadoop
No ratings yet
Hadoop
61 pages
BDMA Part 3
No ratings yet
BDMA Part 3
22 pages
Ds 1
No ratings yet
Ds 1
42 pages
Lect7 IoT BigData1
No ratings yet
Lect7 IoT BigData1
28 pages
Big Data For Dummies: Submitted By: GARIMA SINGH (12609103) PRIYANKA GROVER (12609060
No ratings yet
Big Data For Dummies: Submitted By: GARIMA SINGH (12609103) PRIYANKA GROVER (12609060
38 pages
Record and Replay
No ratings yet
Record and Replay
38 pages
IT6502 Digital Signal Processing Unit II: Course Material (Lecture Notes)
No ratings yet
IT6502 Digital Signal Processing Unit II: Course Material (Lecture Notes)
40 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Hadoop
No ratings yet
Hadoop
7 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
10 - Big Data Architecture and Tools
No ratings yet
10 - Big Data Architecture and Tools
31 pages
Unit 2
No ratings yet
Unit 2
28 pages
Linked List ADT
No ratings yet
Linked List ADT
27 pages
Ericsson: Optimize For Process Configuration
No ratings yet
Ericsson: Optimize For Process Configuration
18 pages
Bda Unit-2
No ratings yet
Bda Unit-2
52 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Unit 3
No ratings yet
Unit 3
18 pages
CASE STUDY On Application of Hadoop
No ratings yet
CASE STUDY On Application of Hadoop
16 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Instant Access To Database Systems A Practical Approach To Design Implementation and Management 6th Edition Thomas M. Connolly Ebook Full Chapters
100% (1)
Instant Access To Database Systems A Practical Approach To Design Implementation and Management 6th Edition Thomas M. Connolly Ebook Full Chapters
61 pages
BDT Unit 2 Textbook
No ratings yet
BDT Unit 2 Textbook
20 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
Big Data Analytics AAM Unit 5
No ratings yet
Big Data Analytics AAM Unit 5
28 pages
Slides-10 On Network Security
No ratings yet
Slides-10 On Network Security
14 pages
Inside Cloud - Case Study
No ratings yet
Inside Cloud - Case Study
11 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
Q. What Is Big Data?
No ratings yet
Q. What Is Big Data?
8 pages
Hadoop Job Runner UI Tool
No ratings yet
Hadoop Job Runner UI Tool
10 pages
Introduction To Modbus - National Instruments
No ratings yet
Introduction To Modbus - National Instruments
6 pages
Hadoop Architecture and Its Functionality
No ratings yet
Hadoop Architecture and Its Functionality
7 pages
Data Science and Big Data UNIT 3
No ratings yet
Data Science and Big Data UNIT 3
11 pages
Unit 2
No ratings yet
Unit 2
9 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
1 Bda Chapter1 Answer
No ratings yet
1 Bda Chapter1 Answer
7 pages
Year 9 ICT MID TERM Exam
No ratings yet
Year 9 ICT MID TERM Exam
9 pages
Apache Hadoop
No ratings yet
Apache Hadoop
11 pages
Daily Notes
No ratings yet
Daily Notes
9 pages
Unit2 Bda
No ratings yet
Unit2 Bda
12 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
SIRP 2020 Guidelines and Instructions
No ratings yet
SIRP 2020 Guidelines and Instructions
8 pages
Cloud Computing - Unit 5
No ratings yet
Cloud Computing - Unit 5
8 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
8 pages
Big Data
No ratings yet
Big Data
4 pages
Assn - No:1 Cloud Computing Assignment 13.10.2019
No ratings yet
Assn - No:1 Cloud Computing Assignment 13.10.2019
4 pages
Dsa Midterm
No ratings yet
Dsa Midterm
5 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Hadoop
No ratings yet
Hadoop
5 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Dictionary
No ratings yet
Dictionary
4 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Proposed Model To Overcome The Problems in Waterfall Model: ABSTRACT - in This Paper, I Have Reviewed The Modifications
No ratings yet
Proposed Model To Overcome The Problems in Waterfall Model: ABSTRACT - in This Paper, I Have Reviewed The Modifications
4 pages
New Innovation: Total Cost 1400
No ratings yet
New Innovation: Total Cost 1400
4 pages
Toraiz SP 16
No ratings yet
Toraiz SP 16
3 pages
VPN Tutorial: An Introduction To VPN Software, VPN Hardware and Protocols
No ratings yet
VPN Tutorial: An Introduction To VPN Software, VPN Hardware and Protocols
7 pages
DML SQL Project
No ratings yet
DML SQL Project
2 pages
Comp Analysis
No ratings yet
Comp Analysis
1 page

Big Data and Hadoop - Suzanne

Uploaded by

Big Data and Hadoop - Suzanne

Uploaded by

Cloud and Big Data Experiential Learning Report

Suzanne Viju Cherian

Introduction to Big Data

Big Data Analytics

Big data analytics technologies and tools

YARN MapReduce Sqoop HBase Hive Flume Pig

Apache Hadoop and Ecosystem

Apache Hadoop is an open source software

Applications built using HADOOP are run on large

Name Node represents every file and directory

Data Node helps manage the state of an HDFS

Master node allows to conduct parallel

Apache Flume is a tool for data ingestion

The flume agent has 3 components: source, sink and channel.

Apache Pig is an abstraction over MapReduce. It is a tool/platform which is

Apache Hive, an open-source data

The Big Data Engineer

You might also like