BDA Module 2-2023

BIG DATA ANALYTICS: Introducing Technologies for Handling Big Data Distributed and Parallel Computing for Big Data, Introducing Hadoop, Cloud Computing and Big Data: Features of Cloud Computing, Cloud Deployment Models, Cloud Delivery Models, Cloud Services for Big Data, Cloud Providers in Big Data Market, In-Memory Computing Technology for Big Data.

Uploaded by

recoverytherapy10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views30 pages

BDA Module 2-2023

Uploaded by

recoverytherapy10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

By

Dr. Jagadamba G
Dept. of ISE, SIT, Tumakuru
Introduction to Hadoop

• Hadoop is not big data, however it plays an integral

part in big data
• Two versions of Hadoop: Hadoop Version 1-1.0 and
Hadoop 2.0.
Hadoop ecosystem is a comprehensive collection of tools and technologies that can
be effectively implemented and deployed to provide Big Data solutions in cost
effective manner
Basic/Simple Hadoop System
Hadoop ecosystem elements at various stages
of data processing
Hadoop Distributed File System (HDFS)
Architecture
• Master-slave architecture
• Name node manages HDFS cluster metadata, Data node stores the data
• Scalable Distributed file system
• Distributed data on local disks on several nodes in the form blocks
Hadoop Distributed File System (HDFS)
Basic function of HDFS
• Manage the data storage on data nodes
• Data node read and write requests from the clients
• Block creation, deletion and replication operations done by datanodes

• Default block size is 64

megabytes
• Good for large blocks
• If 10GB file will be
broken into 10*1024/
64=160 blocks
Hadoop Distributed File System (HDFS)
Data Replication: There is absolutely no need for a client application to track
all blocks. It directs the client to the nearest replica to ensure high
performance.
Data Pipeline: A client application writes a block to the first DataNode in the
pipeline. Then this DataNode takes over and forwards the data to the next
node in the pipeline. This process continues for all the data blocks, and
subsequently all the replicas are written to the disk.
Read Operation in Hadoop
Write Operation in Hadoop
Command line interface
• We are going to view HDFS by interfacing with it from command line (as
command line is easy and popular)
• For executing HDFS on a machine, we need to set up Hadoop in a distributed
mode prototype
• Distributed mode set up for running on local host and on default HDFS port
8020: fs.default.name, set to hdfs://localhost/
• This property is utilized to figure out where the name node is running for
connection
• The HDFS file system is accessed by user application with the help of HDFS
client

Objective: To create a directory (say, sample) in HDFS.

hadoop fs -mkdir /sample

Objective: To copy a file from local file system to HDFS.

hadoop fs -put /root/sample/test.txt /sample/test.txt Objective:
Objective : To copy a file from HDFS to local file system. Act:
hadoop fs -get /sample/test.txt /root/sample/testsample.txt
Key features of HDFS

• Data Replication
• Data Resilience
• Data Integrity: ensure data integrity through -
Maintaining transaction logs, Validating
checksum, Creating data Blocks
To provide flexibility and fault tolerance in
HDFS
• Monitoring- through heart beats
• Rebalancing- Stocks are shifted when free space is available
• Metadata Replication
Introduction to MapReduce

• Algorithms developed and maintained by the apache Hadoop are

implemented in the form of Hadoop MapReduce.
• Mapreduce can be assumed analogous to an engine that takes data input,
process it, generates the output and return the required answers.
• MapReduce is based on parallel programming framework.
• MapReduce facilitates the processing and analysing structured and
unstructured data collected from different sources which may not be
analyzed by traditional tools.
• MapReduce enables computational processing of data stored in a file
system without the requirement of loading the data initially into a
database.
Working of MapReduce
Hadoop YARN

• As the old version of Hadoop scheduler was not able to manage non-
MapReduce jobs and could not optimize cluster utilization-hence YARN
was introduced.
• YARN supports 2 major services- Global Resourse Management( Resourse
manager) and Per-application management(ApplicationMaster)
Why and what is Hbase?
• Hbase is a part of Hadoop where we make use of Hbase for
effective data set structure.
• i.e., the data in different nodes are stored and fetched for big data
analysis.
• HBase is a column oriented distributed database composed on top
of HDFS.
• HBase is used when you need real-time continuous read/write
access to huge data.
• The standard HBase is considered as Web table- a table of web
paged crawled and their properties keyed by the web pages URL.
• It is a non-relational database suitable for distributed environment.
• Does not support SQL.
HBase is open source, multidimensional, distributed, scalable
and NoSQL database written in java.
Hbase Storage Mechanism

• It stores data into rows and columns as in RDBMS.

• Insertion of a row and column are called cell.
• Hbase table is associated with term “versions”, which
provides a timestamp to uniquely identify the cell.
• A cell’s value is an unread array of bytes
• Key in table rows are also byte arrays, so hypothetically
anything can act as a row key
• Table schema defines column families and share common
prefix which are key-value pair. For ex: java:android,
java:servlets are both members of java family.
HBase
• Tables are automatically
partitioned horizontally into
regions.
• Each region consists of subset
of rows.
• Regions have default size of
256mb and can be configured.
• Columns of column family is
stored in one region.
• Regions are units that get
spread over a cluster of Hbase.
• Table too big for any server can
be carried by a cluster of
servers with each node hosting
a subset of all regions of a
table.
HBase persists data via the Hadoop file system API.
HBase in Operation-programming with HBase

• HBase keeps individual tables internally called ROOT and

.META.
• ROOT table contains list of META
• META table contains list of regions
Architecture of HBase
HBase is a column oriented database, relational database is row oriented

HBase architecture has 3 main

components: HMaster, Region
Server, Zookeeper.

HMaster : The implementation of Master Server in HBase is HMaster. It is a

process in which regions are assigned to region server.
Region Server : HBase Tables are divided horizontally by row key range into
Regions.
Zookeeper : It is like a coordinator in HBase. It provides services like maintaining
configuration information, naming, providing distributed synchronization, server
failure notification etc. Clients communicate with region servers via zookeeper.
REST and Thrift interfaces

• REST and Thrift interfaces are used when application is

written in a language other than java.
• In both the cases, java server host the HBase instances and
REST and Thrift interfaces request for HBase storage.
• This extra work makes these interfaces slower than the Java
client.
Mechanism of writing data to HDFS

1. Client creates a record by calling make() function on a distributed system.

2. The distributed file systems makes a RPC call to the NameNode to make another record
in the file systems namespace
3. As the client composes information, Dfsoutstream splits it into fragments called data
queue.
4. Second data Node stores the packet and advances it to the 3rd and last datanode in
pipeline
5. Acknowledgement
6. After completion of writing information, on stream it calls close() function
7. Complete
Comparison between HBase and HDFS

• HBase is a database similar to mysql(but not sql) while HDFS is

a file system.
• HBase provides low latency access while HDFS provides high
latency operations.
• HBase supports random read and writes while HDFS supports
Write once Read Many times.
• HBase is accessed through shell commands, Java API, REST,
Avro or Thrift API while HDFS is accessed through MapReduce
jobs.
• HBase processes realtime data While HDFS stores large
datasets in distributed environment and leverages batch
processing
Combining HBase and HDFS
• File Descriptor shortage
• Not many datanode threads
• Bad blocks
• UI
• Schema design
• Row Key
Features of HBase

• Consistency-Consistent read/write
• Sharding
• High availability- Scalable and Fault tolerance
• Supports java API
• Supports for IT operation
• Hadoop integration
• Data Replication
Hive

• Hive is a data warehousing layer created with the core

elements of Hadoop.
• It exposes a simple SQL –like implementation called HiveQL
for easy integration along with access via mappers and
reducers.
• Hive looks similar to traditional database
Pig and Pig Latin

• Pig is a data flow system for Hadoop. It uses Pig Latin to

specify data flow.
• Pig is an alternative to MapReduce Programming.
• It abstracts some details and allows you to focus on data
processing.
Sqoop

• Sqoop is a tool which helps to transfer data between Hadoop

and Relational Databases.
• With the help of Sqoop, you can import data from RDBMS to
HDFS and vice-versa.
ZooKeeper: coordinates all the elements of the distributed
applications

Flume: aids in transferring large amounts of data from

distributed resources to a single centralized repository.

Oozie: used to manage and process submitted jobs. It’s a

dataware service that coordinates dependencies among
different job executing on platforms of Hadoop, such as HDFS,
Pig and MapReduce.

Bda Notes
No ratings yet
Bda Notes
110 pages
ChatGPT Premium Guide
67% (3)
ChatGPT Premium Guide
152 pages
Gangguan Pendengaran Dan Kelainan Telinga
No ratings yet
Gangguan Pendengaran Dan Kelainan Telinga
157 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Hadoop
No ratings yet
Hadoop
154 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
89443939-Wiring Diagram, FM Cab Facelift (ENG)
100% (1)
89443939-Wiring Diagram, FM Cab Facelift (ENG)
174 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Abtc Vaccination Card
No ratings yet
Abtc Vaccination Card
3 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
BDA Unit-4 Part-2 HBase, Hive, Pig
No ratings yet
BDA Unit-4 Part-2 HBase, Hive, Pig
74 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Module II
No ratings yet
Module II
46 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Unit 5 Lecture No-3 (Hbase)
No ratings yet
Unit 5 Lecture No-3 (Hbase)
35 pages
2 Unit 5
No ratings yet
2 Unit 5
24 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
Unit 2
No ratings yet
Unit 2
9 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Unit - IV - Notes
No ratings yet
Unit - IV - Notes
23 pages
Unit 3
No ratings yet
Unit 3
61 pages
FSSC 22000 Ing 2022
No ratings yet
FSSC 22000 Ing 2022
2 pages
Unit 5 Big Data
No ratings yet
Unit 5 Big Data
34 pages
Big Data UNIT 5 Own
No ratings yet
Big Data UNIT 5 Own
18 pages
Hbase Understanding Mapreduce: Unit-2 P-2
No ratings yet
Hbase Understanding Mapreduce: Unit-2 P-2
32 pages
Geometric Design For Highways and Railways Including Cross Sections Horizontal and Vertical Alignments Super Elevation and Earthworks - Compress
No ratings yet
Geometric Design For Highways and Railways Including Cross Sections Horizontal and Vertical Alignments Super Elevation and Earthworks - Compress
23 pages
Define Zookeeper
No ratings yet
Define Zookeeper
20 pages
2003 Peugeot 807 65093 PDF
No ratings yet
2003 Peugeot 807 65093 PDF
184 pages
Unit 3 & 4 Big Data
No ratings yet
Unit 3 & 4 Big Data
18 pages
Unit 4
No ratings yet
Unit 4
15 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop and HBase
No ratings yet
Hadoop and HBase
31 pages
LK315D3HA54
No ratings yet
LK315D3HA54
22 pages
Hadoop Ecosystem
100% (2)
Hadoop Ecosystem
33 pages
BDT Unit 2 Textbook
No ratings yet
BDT Unit 2 Textbook
20 pages
BDT Unit - V
No ratings yet
BDT Unit - V
15 pages
Bda Unit 5
No ratings yet
Bda Unit 5
16 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Hbase
No ratings yet
Hbase
15 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
Big Data Unit 5
No ratings yet
Big Data Unit 5
18 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Wa0005.
No ratings yet
Wa0005.
17 pages
Unit 5
No ratings yet
Unit 5
10 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
Hbase - Quick Guide Hbase - Overview
No ratings yet
Hbase - Quick Guide Hbase - Overview
53 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
HBASE
No ratings yet
HBASE
35 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
HBase
No ratings yet
HBase
6 pages
M. Tech. Bulletin: Aerospace Engineering Department
No ratings yet
M. Tech. Bulletin: Aerospace Engineering Department
34 pages
Cabbage: Schedule of Cabbage Production Practices
No ratings yet
Cabbage: Schedule of Cabbage Production Practices
19 pages
Unit V
No ratings yet
Unit V
6 pages
Blueberries: Growing Beyond Production Challenges
No ratings yet
Blueberries: Growing Beyond Production Challenges
12 pages
Astm F513-00
No ratings yet
Astm F513-00
14 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Mandeville-The Grumbling Hive
No ratings yet
Mandeville-The Grumbling Hive
5 pages
Strategic Competitice Analysis
No ratings yet
Strategic Competitice Analysis
30 pages
Model BFV-300 Butterfly Valve Wafer Style General Description Technical Data
No ratings yet
Model BFV-300 Butterfly Valve Wafer Style General Description Technical Data
8 pages
Eportfolio Career Research Paper
100% (3)
Eportfolio Career Research Paper
8 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Client Duties Case Preparation
No ratings yet
Client Duties Case Preparation
2 pages
Paper 2 Question
No ratings yet
Paper 2 Question
4 pages
FoxScanner+Update+Guide+EN V1.00
No ratings yet
FoxScanner+Update+Guide+EN V1.00
12 pages
Legal Framework For Truck Logistics in India
No ratings yet
Legal Framework For Truck Logistics in India
2 pages
Latihan Lab 3 - General Ledger and Adjusting Entries
No ratings yet
Latihan Lab 3 - General Ledger and Adjusting Entries
3 pages
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
No ratings yet
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
6 pages
18 Home Savings vs. Dailo
No ratings yet
18 Home Savings vs. Dailo
11 pages
Engine Test Stands For Automotive Technicians
No ratings yet
Engine Test Stands For Automotive Technicians
6 pages
Hadoop Ecosystem: An Introduction: Sneha Mehta, Viral Mehta
No ratings yet
Hadoop Ecosystem: An Introduction: Sneha Mehta, Viral Mehta
6 pages
2007 Bakery Whole Catalog
No ratings yet
2007 Bakery Whole Catalog
13 pages
EGR System Diagnostic Procedures
No ratings yet
EGR System Diagnostic Procedures
7 pages
Big Data Introduction & Ecosystems
No ratings yet
Big Data Introduction & Ecosystems
4 pages
University of The Philippines College of Law: CPE, 1-D
No ratings yet
University of The Philippines College of Law: CPE, 1-D
2 pages
SO12913 ORBITech PDF
No ratings yet
SO12913 ORBITech PDF
1 page
NEBOSH IGC1-PART-2 (Answer)
No ratings yet
NEBOSH IGC1-PART-2 (Answer)
4 pages
Agilent ERP Failure
No ratings yet
Agilent ERP Failure
2 pages
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet