0% found this document useful (0 votes)

26 views28 pages

Big Data (Hadoop)

The document provides an overview of Big Data and Hadoop, highlighting the challenges organizations face with large data volumes and the key attributes of Big Data such as velocity, volume, and variety. It details Hadoop's architecture, components, and features, including its ability to process and store data using a distributed framework. Additionally, it covers various components of the Hadoop ecosystem like HDFS, MapReduce, Hive, Pig, and others, explaining their functionalities and roles in managing Big Data.

Uploaded by

rakesh201629

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views28 pages

Big Data (Hadoop)

Uploaded by

rakesh201629

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Big Data(Hadoop)

 Big Data is about growing challenge that organisations face as they

deal with large and fast growing sources of data or information.

 Big Data deals with PetaBytes of data(in traditional TeraBytes only).

 Big Data challenges include,

 Capturing data
 Data Storage
 Data Analysis
 Searching
 Data Transfer
 Visualization
 Attributes of Big Data:

 Velocity

 Volume

 Variety

 Big Data is resulting into,

 Large and Growing Files

 At High Speed

 In Various Format
Hadoop:

 Hadoop is an Open Source Software Framework.

 Hadoop is used for Distributed data and Processing Dataset of Big

Data.

 The Objective of Hadoop supports Running application on

BigData.

 Hadoop deals,
 Storage

 Process
Key Features Of Hadoop:

 Open Source

 Distributed Technology

 Batch Processing

 Fault tolerance

 Replication

 Scalability
Commodity Hardware for Hadoop:

 Low kind of Hardware.

 Inexpensive Software.

 Distributed mechanisms for Hadoop

 Cloudera

MapR

Horton Networks

Apcche
Hadoop Cluster Nodes:

 Hadoop Cluster Nodes are Storage and Process.

HDFS Storage
NODE
MAPREDUCE
Process

 Data can be Stored in Blocks.

 Each Block size is 64KB.

 Default file for blocks,

/home/hadoop/conf/hdfs-site.Xml
Hadoop Architecture:
 Components of Hadoop Architecture,

 Name Node

 Secondary Name Node

 Data Node

Job Tracker

Task Tracker
Diagramatic representation:
Slave Slave
node node

Name Node
Data
hdfs

Node Dat
a
nod
Data Node
e

Task Tracker
mapreduce

Task
Task
Tracke
Job Tracker Tracke
r
r
Name Node:

Name Node divide the file/application into blocks based

configuration.

Name Node can give Physical Locations of Hadoop Cluster.

Name Node can deal Meta Data only.

Data Node:

Each and Every Slave Node can be called Data Node.

There is NO Threshold Value.

It will increase the data node based on volume of data.

It is Work-Horse of hadoop file system.

Secondary Name Node:

SNN perform functionalities same as name node.

It will gives Physical address/locations of blocks.

And combining the blocks.

SNN is not directly back node for primary node.

Job Tracker:

Job Tracker is always reassemble to Name Node only.

The responsibilities of Job Tracker is

 Assign Tasks

Schedule Tasks

Re-schedule Tasks
Task Tracker:

The responsibilities of Task Tracker is executing the tasks assigned

by the Job Tracker.

The communication between Job Tracker and Task Tracker by

MapReduce jobs(MRjobs).
Hadoop Ecosystem:

 Hdfs
 MapReduce
 Hive
 Pig
 Sqoop
 Hbase
 Oozie
 Flume
 Mahout
 Impala
 YARN
HDFS:

Node contains Local File System(LFS) and Hdfs,MR.

HDFS
MR

LFS

Node failure means LFS have node information but there is no

information in HDFS,MR.

The Meta Data files are : FSImage, EditLog.

HDFS Features:

Support for very Large Files.

Commodity Hardware.

High Latency.

Streaming Access/Sequential file Access.

WRITE ONCE and READ many times.

MapReduce:

MapReduce is built on top HDFS.

It is Processing the huge amount of data in very parallel manner on

the commodity machine.

MapReduce component is working on the KEY-VALUE

architecture.

It have two processing daemons,

 Job Tracker

 Task Tracker
Phases in MapReduce:

 There are three phases,

 MAPPER Phase

Sort & Shuffle Phase

REDUCER Phase

input (K,V) Sort& (K,V) output

Mapper Shuffle Redducer
File Formats in MapReduce:

FileInputForma (FIF)

FileOutputFormat (FOF)

TextInputFormat (TIF)

TextOutputForma (TOF)

KeyValueTextInputFormat (KVTIF)

NLineInputFormat (NLINE)

DBInputFormat (DBIF)
Combiner:

Combiner is one of the predefined functionality of MapReduce.

It is going to applied on the Mapper Class.

It can achieve Network Optimisation.

Jobobj.SetCombinerClass(<<CombinerClassName.class>>);
PIG:

Pig is one of the component of the hadoop built on top of HDFS.

It is Abstract and high level languages on top of MapReduce

Programming model.

Pig is meant for querring, data summaration and for advanced

querring.

PIGLatin is language to express the PIG related statements.

Different Modes of PIG Execution:

Local Mode:

LFS

HDFS Mode/MapaReduce Mode:

HDFS

HDFS
Different flavours of PIG Execution:

Grunt Shell

Script Mode

Embedded Mode
HIVE:

HIVE is one of the Component of hadoop built on top of HDFS.

It is Warehouse kind of system in hadoop.

HIVE is meant for data summarization, querring and advanced

querring.

The complete data of Hive is going to be organize by the means of

two table,
 Manage tables

 External tables
SQOOP:

SQOOP is one of the component of hadoop built on top of HDFS.

It is meant for interacting with RDBMS.

It is to import the data from RDBMS tables to hadoopworld(hdfs).

It is to export the processed data from hadoop world(hdfs) to

RDBMS tables.
HBASE:

Hbase is built on top of hdfs and is used for performing real time
random reads/writers.

Hbase is a open source distributed, scalable, fault tolerance, multi

dimensional, versioned and column oriented database.

It does not have Query language.

It cannot be used for transaction processing.

OOZIE:

OOZIE meant for creating the workflow and scheduling same i.e
job scheduling tool in hadoop.

Oozie is open source, distributed, scalable, fault tolerant, java based

web application access through GUI.

Oozie working in principle called Direct Acyclic Graph(DAG).

It is sequential way of job Execution.

FLUME:

Flume is for collecting the live streaming data and distributed the same
data over hdfs paths.

Flume Source: It is collect the data from events.

Interceptors: To send speculative data to collectors.

Collectors: It is converts data in seculars format to flume Sink.

Big Data and Its Impact On Agriculture
No ratings yet
Big Data and Its Impact On Agriculture
2 pages
Perplexity
No ratings yet
Perplexity
29 pages
(IJETA-V11I3P41) :amit Kumar, Pankaj Jain, Ishaan Saxena, Jhanvi Bhayana, Koyal Ghosh, Pratham Sharma
No ratings yet
(IJETA-V11I3P41) :amit Kumar, Pankaj Jain, Ishaan Saxena, Jhanvi Bhayana, Koyal Ghosh, Pratham Sharma
9 pages
Cloud Security UNIT 5
No ratings yet
Cloud Security UNIT 5
4 pages
MIS Midterm New
No ratings yet
MIS Midterm New
14 pages
Hadoop
No ratings yet
Hadoop
5 pages
BDP Unit 3
No ratings yet
BDP Unit 3
20 pages
Research in International Business and Finance: Yang Wang, Sui Xiuping, Qi Zhang
No ratings yet
Research in International Business and Finance: Yang Wang, Sui Xiuping, Qi Zhang
9 pages
Introduction To
No ratings yet
Introduction To
7 pages
Ijeta V7i4p7
No ratings yet
Ijeta V7i4p7
6 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
Unit Iii
No ratings yet
Unit Iii
9 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
1-S2.0-S0278612521001448-Main Ok
No ratings yet
1-S2.0-S0278612521001448-Main Ok
15 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Data Analytics Engineering Ms
No ratings yet
Data Analytics Engineering Ms
6 pages
B.Tech.-All Except CSBS
No ratings yet
B.Tech.-All Except CSBS
4 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
Unit 4: Business Models For Digital Financial Services: DR, Anand Patil, Christ University, Bangalore, India
No ratings yet
Unit 4: Business Models For Digital Financial Services: DR, Anand Patil, Christ University, Bangalore, India
7 pages
Unit 2
No ratings yet
Unit 2
9 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
18 Module 2
No ratings yet
18 Module 2
9 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
AI Pioneers in Investment Management
No ratings yet
AI Pioneers in Investment Management
44 pages
Data Ingestion, Processing and Architecture Layers For Big Data and Iot
No ratings yet
Data Ingestion, Processing and Architecture Layers For Big Data and Iot
32 pages
Laudon Mis16 PPT Ch06 KL CE
100% (1)
Laudon Mis16 PPT Ch06 KL CE
43 pages
CT2 BDTT
No ratings yet
CT2 BDTT
6 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
19 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Ijerph 15 02796 PDF
No ratings yet
Ijerph 15 02796 PDF
9 pages
SCA Assignment 1
No ratings yet
SCA Assignment 1
2 pages
Students' Admission Prediction Using GRBST With Distributed Data Mining
No ratings yet
Students' Admission Prediction Using GRBST With Distributed Data Mining
5 pages
HADOOP
No ratings yet
HADOOP
19 pages
Data Science: by Neha Tyagi
100% (1)
Data Science: by Neha Tyagi
17 pages
Bda Mod 1
No ratings yet
Bda Mod 1
32 pages
BDACh 02 L01 Hadoop
No ratings yet
BDACh 02 L01 Hadoop
24 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Body & Society 2016 Ball 58 81
No ratings yet
Body & Society 2016 Ball 58 81
24 pages
Deborah Phillips-Longman Complete Course For The TOEFL Test - Preparation For The Computer and Paper Tests-Prentice Hall College Div (2001)
No ratings yet
Deborah Phillips-Longman Complete Course For The TOEFL Test - Preparation For The Computer and Paper Tests-Prentice Hall College Div (2001)
40 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Predictive Factories:: The Next Transformation
No ratings yet
Predictive Factories:: The Next Transformation
4 pages
DHL Introduces New SmartSensor Technology
No ratings yet
DHL Introduces New SmartSensor Technology
10 pages
Data-Intensive Computing
No ratings yet
Data-Intensive Computing
88 pages
A00-220 Study Guide and How To Crack Exa PDF
No ratings yet
A00-220 Study Guide and How To Crack Exa PDF
6 pages
D Role
No ratings yet
D Role
68 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
IBM BIgData Spark
100% (1)
IBM BIgData Spark
80 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
Latest Topics in Computer Science For Project and Thesis PDF
No ratings yet
Latest Topics in Computer Science For Project and Thesis PDF
14 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
DSX InfoSphere DataStage Is Big Data Integration 2013-05-13
0% (1)
DSX InfoSphere DataStage Is Big Data Integration 2013-05-13
30 pages
Chapter 2 - 大数据生态系统
No ratings yet
Chapter 2 - 大数据生态系统
31 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Big Data Strategy
0% (1)
Big Data Strategy
5 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Big Data Introduction PDF
No ratings yet
Big Data Introduction PDF
180 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
Big Data
No ratings yet
Big Data
67 pages
Jenny Blog
No ratings yet
Jenny Blog
12 pages
Hadoop and Their Ecosystem
100% (2)
Hadoop and Their Ecosystem
24 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
No ratings yet
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
52 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages

Big Data (Hadoop)

Uploaded by

Big Data (Hadoop)

Uploaded by

Big Data(Hadoop)

 Big Data is about growing challenge that organisations face as they

deal with large and fast growing sources of data or information.

 Big Data deals with PetaBytes of data(in traditional TeraBytes only).

 Big Data challenges include,

 Big Data is resulting into,

 Large and Growing Files

 Hadoop is an Open Source Software Framework.

 Hadoop is used for Distributed data and Processing Dataset of Big

 The Objective of Hadoop supports Running application on

 Low kind of Hardware.

 Distributed mechanisms for Hadoop

 Hadoop Cluster Nodes are Storage and Process.

 Data can be Stored in Blocks.

 Each Block size is 64KB.

 Default file for blocks,

 Secondary Name Node

Name Node divide the file/application into blocks based

Name Node can give Physical Locations of Hadoop Cluster.

Name Node can deal Meta Data only.

Each and Every Slave Node can be called Data Node.

There is NO Threshold Value.

It will increase the data node based on volume of data.

It is Work-Horse of hadoop file system.

SNN perform functionalities same as name node.

It will gives Physical address/locations of blocks.

And combining the blocks.

SNN is not directly back node for primary node.

Job Tracker is always reassemble to Name Node only.

The responsibilities of Job Tracker is

The responsibilities of Task Tracker is executing the tasks assigned

The communication between Job Tracker and Task Tracker by

Node contains Local File System(LFS) and Hdfs,MR.

Node failure means LFS have node information but there is no

The Meta Data files are : FSImage, EditLog.

Support for very Large Files.

Streaming Access/Sequential file Access.

WRITE ONCE and READ many times.

MapReduce is built on top HDFS.

It is Processing the huge amount of data in very parallel manner on

MapReduce component is working on the KEY-VALUE

It have two processing daemons,

 There are three phases,

Sort & Shuffle Phase

input (K,V) Sort& (K,V) output

Combiner is one of the predefined functionality of MapReduce.

It is going to applied on the Mapper Class.

It can achieve Network Optimisation.

Pig is one of the component of the hadoop built on top of HDFS.

It is Abstract and high level languages on top of MapReduce

Pig is meant for querring, data summaration and for advanced

PIGLatin is language to express the PIG related statements.

HDFS Mode/MapaReduce Mode:

HIVE is one of the Component of hadoop built on top of HDFS.

It is Warehouse kind of system in hadoop.

HIVE is meant for data summarization, querring and advanced

The complete data of Hive is going to be organize by the means of

SQOOP is one of the component of hadoop built on top of HDFS.

It is meant for interacting with RDBMS.

It is to import the data from RDBMS tables to hadoopworld(hdfs).

It is to export the processed data from hadoop world(hdfs) to

Hbase is a open source distributed, scalable, fault tolerance, multi

It does not have Query language.

It cannot be used for transaction processing.

Oozie is open source, distributed, scalable, fault tolerant, java based

Oozie working in principle called Direct Acyclic Graph(DAG).

It is sequential way of job Execution.

Flume Source: It is collect the data from events.

Interceptors: To send speculative data to collectors.

Collectors: It is converts data in seculars format to flume Sink.

You might also like