BDA Module 2-2023
BDA Module 2-2023
BDA Module 2-2023
Dr. Jagadamba G
Dept. of ISE, SIT, Tumakuru
Introduction to Hadoop
• Data Replication
• Data Resilience
• Data Integrity: ensure data integrity through -
Maintaining transaction logs, Validating
checksum, Creating data Blocks
To provide flexibility and fault tolerance in
HDFS
• Monitoring- through heart beats
• Rebalancing- Stocks are shifted when free space is available
• Metadata Replication
Introduction to MapReduce
• As the old version of Hadoop scheduler was not able to manage non-
MapReduce jobs and could not optimize cluster utilization-hence YARN
was introduced.
• YARN supports 2 major services- Global Resourse Management( Resourse
manager) and Per-application management(ApplicationMaster)
Why and what is Hbase?
• Hbase is a part of Hadoop where we make use of Hbase for
effective data set structure.
• i.e., the data in different nodes are stored and fetched for big data
analysis.
• HBase is a column oriented distributed database composed on top
of HDFS.
• HBase is used when you need real-time continuous read/write
access to huge data.
• The standard HBase is considered as Web table- a table of web
paged crawled and their properties keyed by the web pages URL.
• It is a non-relational database suitable for distributed environment.
• Does not support SQL.
HBase is open source, multidimensional, distributed, scalable
and NoSQL database written in java.
Hbase Storage Mechanism
• Consistency-Consistent read/write
• Sharding
• High availability- Scalable and Fault tolerance
• Supports java API
• Supports for IT operation
• Hadoop integration
• Data Replication
Hive