0% found this document useful (0 votes)
26 views28 pages

Big Data (Hadoop)

The document provides an overview of Big Data and Hadoop, highlighting the challenges organizations face with large data volumes and the key attributes of Big Data such as velocity, volume, and variety. It details Hadoop's architecture, components, and features, including its ability to process and store data using a distributed framework. Additionally, it covers various components of the Hadoop ecosystem like HDFS, MapReduce, Hive, Pig, and others, explaining their functionalities and roles in managing Big Data.

Uploaded by

rakesh201629
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views28 pages

Big Data (Hadoop)

The document provides an overview of Big Data and Hadoop, highlighting the challenges organizations face with large data volumes and the key attributes of Big Data such as velocity, volume, and variety. It details Hadoop's architecture, components, and features, including its ability to process and store data using a distributed framework. Additionally, it covers various components of the Hadoop ecosystem like HDFS, MapReduce, Hive, Pig, and others, explaining their functionalities and roles in managing Big Data.

Uploaded by

rakesh201629
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Big Data(Hadoop)

 Big Data is about growing challenge that organisations face as they

deal with large and fast growing sources of data or information.

 Big Data deals with PetaBytes of data(in traditional TeraBytes only).

 Big Data challenges include,


 Capturing data
 Data Storage
 Data Analysis
 Searching
 Data Transfer
 Visualization
 Attributes of Big Data:

 Velocity

 Volume

 Variety

 Big Data is resulting into,

 Large and Growing Files

 At High Speed

 In Various Format
Hadoop:

 Hadoop is an Open Source Software Framework.

 Hadoop is used for Distributed data and Processing Dataset of Big


Data.

 The Objective of Hadoop supports Running application on


BigData.

 Hadoop deals,
 Storage

 Process
Key Features Of Hadoop:

 Open Source

 Distributed Technology

 Batch Processing

 Fault tolerance

 Replication

 Scalability
Commodity Hardware for Hadoop:

 Low kind of Hardware.

 Inexpensive Software.

 Distributed mechanisms for Hadoop

 Cloudera

MapR

Horton Networks

Apcche
Hadoop Cluster Nodes:

 Hadoop Cluster Nodes are Storage and Process.

HDFS Storage
NODE
MAPREDUCE
Process

 Data can be Stored in Blocks.

 Each Block size is 64KB.

 Default file for blocks,

/home/hadoop/conf/hdfs-site.Xml
Hadoop Architecture:
 Components of Hadoop Architecture,

 Name Node

 Secondary Name Node

 Data Node

Job Tracker

Task Tracker
Diagramatic representation:
Slave Slave
node node

Name Node
Data
hdfs

Node Dat
a
nod
Data Node
e

Task Tracker
mapreduce

Task
Task
Tracke
Job Tracker Tracke
r
r
Name Node:

Name Node divide the file/application into blocks based


configuration.

Name Node can give Physical Locations of Hadoop Cluster.

Name Node can deal Meta Data only.


Data Node:

Each and Every Slave Node can be called Data Node.

There is NO Threshold Value.

It will increase the data node based on volume of data.

It is Work-Horse of hadoop file system.


Secondary Name Node:

SNN perform functionalities same as name node.

It will gives Physical address/locations of blocks.

And combining the blocks.

SNN is not directly back node for primary node.


Job Tracker:

Job Tracker is always reassemble to Name Node only.

The responsibilities of Job Tracker is

 Assign Tasks

Schedule Tasks

Re-schedule Tasks
Task Tracker:

The responsibilities of Task Tracker is executing the tasks assigned


by the Job Tracker.

The communication between Job Tracker and Task Tracker by


MapReduce jobs(MRjobs).
Hadoop Ecosystem:

 Hdfs
 MapReduce
 Hive
 Pig
 Sqoop
 Hbase
 Oozie
 Flume
 Mahout
 Impala
 YARN
HDFS:

Node contains Local File System(LFS) and Hdfs,MR.

HDFS
MR

LFS

Node failure means LFS have node information but there is no


information in HDFS,MR.

The Meta Data files are : FSImage, EditLog.


HDFS Features:

Support for very Large Files.

Commodity Hardware.

High Latency.

Streaming Access/Sequential file Access.

WRITE ONCE and READ many times.


MapReduce:

MapReduce is built on top HDFS.

It is Processing the huge amount of data in very parallel manner on


the commodity machine.

MapReduce component is working on the KEY-VALUE


architecture.

It have two processing daemons,

 Job Tracker

 Task Tracker
Phases in MapReduce:

 There are three phases,

 MAPPER Phase

Sort & Shuffle Phase

REDUCER Phase

input (K,V) Sort& (K,V) output


Mapper Shuffle Redducer
File Formats in MapReduce:

FileInputForma (FIF)

FileOutputFormat (FOF)

TextInputFormat (TIF)

TextOutputForma (TOF)

KeyValueTextInputFormat (KVTIF)

NLineInputFormat (NLINE)

DBInputFormat (DBIF)
Combiner:

Combiner is one of the predefined functionality of MapReduce.

It is going to applied on the Mapper Class.

It can achieve Network Optimisation.

Jobobj.SetCombinerClass(<<CombinerClassName.class>>);
PIG:

Pig is one of the component of the hadoop built on top of HDFS.

It is Abstract and high level languages on top of MapReduce


Programming model.

Pig is meant for querring, data summaration and for advanced


querring.

PIGLatin is language to express the PIG related statements.


Different Modes of PIG Execution:

Local Mode:

LFS

LFS

HDFS Mode/MapaReduce Mode:

HDFS

HDFS
Different flavours of PIG Execution:

Grunt Shell

Script Mode

Embedded Mode
HIVE:

HIVE is one of the Component of hadoop built on top of HDFS.

It is Warehouse kind of system in hadoop.

HIVE is meant for data summarization, querring and advanced


querring.

The complete data of Hive is going to be organize by the means of


two table,
 Manage tables

 External tables
SQOOP:

SQOOP is one of the component of hadoop built on top of HDFS.

It is meant for interacting with RDBMS.

It is to import the data from RDBMS tables to hadoopworld(hdfs).

It is to export the processed data from hadoop world(hdfs) to


RDBMS tables.
HBASE:

Hbase is built on top of hdfs and is used for performing real time
random reads/writers.

Hbase is a open source distributed, scalable, fault tolerance, multi


dimensional, versioned and column oriented database.

It does not have Query language.

It cannot be used for transaction processing.


OOZIE:

OOZIE meant for creating the workflow and scheduling same i.e
job scheduling tool in hadoop.

Oozie is open source, distributed, scalable, fault tolerant, java based


web application access through GUI.

Oozie working in principle called Direct Acyclic Graph(DAG).

It is sequential way of job Execution.


FLUME:

Flume is for collecting the live streaming data and distributed the same
data over hdfs paths.

Flume Source: It is collect the data from events.

Interceptors: To send speculative data to collectors.

Collectors: It is converts data in seculars format to flume Sink.

You might also like