BDA Module 2-2023

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

By

Dr. Jagadamba G
Dept. of ISE, SIT, Tumakuru
Introduction to Hadoop

• Hadoop is not big data, however it plays an integral


part in big data
• Two versions of Hadoop: Hadoop Version 1-1.0 and
Hadoop 2.0.
Hadoop ecosystem is a comprehensive collection of tools and technologies that can
be effectively implemented and deployed to provide Big Data solutions in cost
effective manner
Basic/Simple Hadoop System
Hadoop ecosystem elements at various stages
of data processing
Hadoop Distributed File System (HDFS)
Architecture
• Master-slave architecture
• Name node manages HDFS cluster metadata, Data node stores the data
• Scalable Distributed file system
• Distributed data on local disks on several nodes in the form blocks
Hadoop Distributed File System (HDFS)
Basic function of HDFS
• Manage the data storage on data nodes
• Data node read and write requests from the clients
• Block creation, deletion and replication operations done by datanodes

• Default block size is 64


megabytes
• Good for large blocks
• If 10GB file will be
broken into 10*1024/
64=160 blocks
Hadoop Distributed File System (HDFS)
Data Replication: There is absolutely no need for a client application to track
all blocks. It directs the client to the nearest replica to ensure high
performance.
Data Pipeline: A client application writes a block to the first DataNode in the
pipeline. Then this DataNode takes over and forwards the data to the next
node in the pipeline. This process continues for all the data blocks, and
subsequently all the replicas are written to the disk.
Read Operation in Hadoop
Write Operation in Hadoop
Command line interface
• We are going to view HDFS by interfacing with it from command line (as
command line is easy and popular)
• For executing HDFS on a machine, we need to set up Hadoop in a distributed
mode prototype
• Distributed mode set up for running on local host and on default HDFS port
8020: fs.default.name, set to hdfs://localhost/
• This property is utilized to figure out where the name node is running for
connection
• The HDFS file system is accessed by user application with the help of HDFS
client

Objective: To create a directory (say, sample) in HDFS.


hadoop fs -mkdir /sample

Objective: To copy a file from local file system to HDFS.


hadoop fs -put /root/sample/test.txt /sample/test.txt Objective:
Objective : To copy a file from HDFS to local file system. Act:
hadoop fs -get /sample/test.txt /root/sample/testsample.txt
Key features of HDFS

• Data Replication
• Data Resilience
• Data Integrity: ensure data integrity through -
Maintaining transaction logs, Validating
checksum, Creating data Blocks
To provide flexibility and fault tolerance in
HDFS
• Monitoring- through heart beats
• Rebalancing- Stocks are shifted when free space is available
• Metadata Replication
Introduction to MapReduce

• Algorithms developed and maintained by the apache Hadoop are


implemented in the form of Hadoop MapReduce.
• Mapreduce can be assumed analogous to an engine that takes data input,
process it, generates the output and return the required answers.
• MapReduce is based on parallel programming framework.
• MapReduce facilitates the processing and analysing structured and
unstructured data collected from different sources which may not be
analyzed by traditional tools.
• MapReduce enables computational processing of data stored in a file
system without the requirement of loading the data initially into a
database.
Working of MapReduce
Hadoop YARN

• As the old version of Hadoop scheduler was not able to manage non-
MapReduce jobs and could not optimize cluster utilization-hence YARN
was introduced.
• YARN supports 2 major services- Global Resourse Management( Resourse
manager) and Per-application management(ApplicationMaster)
Why and what is Hbase?
• Hbase is a part of Hadoop where we make use of Hbase for
effective data set structure.
• i.e., the data in different nodes are stored and fetched for big data
analysis.
• HBase is a column oriented distributed database composed on top
of HDFS.
• HBase is used when you need real-time continuous read/write
access to huge data.
• The standard HBase is considered as Web table- a table of web
paged crawled and their properties keyed by the web pages URL.
• It is a non-relational database suitable for distributed environment.
• Does not support SQL.
HBase is open source, multidimensional, distributed, scalable
and NoSQL database written in java.
Hbase Storage Mechanism

• It stores data into rows and columns as in RDBMS.


• Insertion of a row and column are called cell.
• Hbase table is associated with term “versions”, which
provides a timestamp to uniquely identify the cell.
• A cell’s value is an unread array of bytes
• Key in table rows are also byte arrays, so hypothetically
anything can act as a row key
• Table schema defines column families and share common
prefix which are key-value pair. For ex: java:android,
java:servlets are both members of java family.
HBase
• Tables are automatically
partitioned horizontally into
regions.
• Each region consists of subset
of rows.
• Regions have default size of
256mb and can be configured.
• Columns of column family is
stored in one region.
• Regions are units that get
spread over a cluster of Hbase.
• Table too big for any server can
be carried by a cluster of
servers with each node hosting
a subset of all regions of a
table.
HBase persists data via the Hadoop file system API.
HBase in Operation-programming with HBase

• HBase keeps individual tables internally called ROOT and


.META.
• ROOT table contains list of META
• META table contains list of regions
Architecture of HBase
HBase is a column oriented database, relational database is row oriented

HBase architecture has 3 main


components: HMaster, Region
Server, Zookeeper.

HMaster : The implementation of Master Server in HBase is HMaster. It is a


process in which regions are assigned to region server.
Region Server : HBase Tables are divided horizontally by row key range into
Regions.
Zookeeper : It is like a coordinator in HBase. It provides services like maintaining
configuration information, naming, providing distributed synchronization, server
failure notification etc. Clients communicate with region servers via zookeeper.
REST and Thrift interfaces

• REST and Thrift interfaces are used when application is


written in a language other than java.
• In both the cases, java server host the HBase instances and
REST and Thrift interfaces request for HBase storage.
• This extra work makes these interfaces slower than the Java
client.
Mechanism of writing data to HDFS

1. Client creates a record by calling make() function on a distributed system.


2. The distributed file systems makes a RPC call to the NameNode to make another record
in the file systems namespace
3. As the client composes information, Dfsoutstream splits it into fragments called data
queue.
4. Second data Node stores the packet and advances it to the 3rd and last datanode in
pipeline
5. Acknowledgement
6. After completion of writing information, on stream it calls close() function
7. Complete
Comparison between HBase and HDFS

• HBase is a database similar to mysql(but not sql) while HDFS is


a file system.
• HBase provides low latency access while HDFS provides high
latency operations.
• HBase supports random read and writes while HDFS supports
Write once Read Many times.
• HBase is accessed through shell commands, Java API, REST,
Avro or Thrift API while HDFS is accessed through MapReduce
jobs.
• HBase processes realtime data While HDFS stores large
datasets in distributed environment and leverages batch
processing
Combining HBase and HDFS
• File Descriptor shortage
• Not many datanode threads
• Bad blocks
• UI
• Schema design
• Row Key
Features of HBase

• Consistency-Consistent read/write
• Sharding
• High availability- Scalable and Fault tolerance
• Supports java API
• Supports for IT operation
• Hadoop integration
• Data Replication
Hive

• Hive is a data warehousing layer created with the core


elements of Hadoop.
• It exposes a simple SQL –like implementation called HiveQL
for easy integration along with access via mappers and
reducers.
• Hive looks similar to traditional database
Pig and Pig Latin

• Pig is a data flow system for Hadoop. It uses Pig Latin to


specify data flow.
• Pig is an alternative to MapReduce Programming.
• It abstracts some details and allows you to focus on data
processing.
Sqoop

• Sqoop is a tool which helps to transfer data between Hadoop


and Relational Databases.
• With the help of Sqoop, you can import data from RDBMS to
HDFS and vice-versa.
ZooKeeper: coordinates all the elements of the distributed
applications

Flume: aids in transferring large amounts of data from


distributed resources to a single centralized repository.

Oozie: used to manage and process submitted jobs. It’s a


dataware service that coordinates dependencies among
different job executing on platforms of Hadoop, such as HDFS,
Pig and MapReduce.

You might also like