0% found this document useful (0 votes)
5 views76 pages

Mod 4

Module 4 provides an overview of big data, focusing on its structure, analytics, and technologies like Hadoop and cloud computing. It discusses the importance of recommender systems, the elements of big data (volume, velocity, variety, veracity), and various analytics types (descriptive, predictive, prescriptive). Additionally, it covers cloud computing features, deployment models, and the Hadoop ecosystem, including its components like HDFS and MapReduce.

Uploaded by

abiat2246
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views76 pages

Mod 4

Module 4 provides an overview of big data, focusing on its structure, analytics, and technologies like Hadoop and cloud computing. It discusses the importance of recommender systems, the elements of big data (volume, velocity, variety, veracity), and various analytics types (descriptive, predictive, prescriptive). Additionally, it covers cloud computing features, deployment models, and the Hadoop ecosystem, including its components like HDFS and MapReduce.

Uploaded by

abiat2246
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Module 4

Overview of Big data


Structuring Big Data, Elements of Big Data, Big Data Analytics –
Introducing Technologies for Handling Big Data : Hadoop, Cloud
computing and Big data, In-memory computing
Answers from computers- Recommendation systems
• Recommender systems process all the information related to users' online activity: their
preferences, their interests, the things they purchase, the content they consume… in order
to show them personalized advertising or recommendations on specific news or products.
• A recommender system is a system which predicts ratings a user might give to a specific
item.
• They're used by various large name companies like Google, Instagram, Spotify, Amazon,
Netflix etc.
• A recommender system, or a recommendation system, is a subclass of information filtering
system that provide suggestions for items that are most pertinent to a particular user.
• Netflix awarded a $1 million prize to a developer team in 2009 for an algorithm that
increased the accuracy of the company’s recommendation engine by 10 percent.
Structuring of Big data
• Their purpose is to either partially or fully automate the process of
data classification and categorization in order to save time and effort
and to make it easier to shift through larger volumes of data without
the need for extensive human intervention.
• Recommendation system can analyze and structure a large amount of
data specifically for you on the basis of what you searched, what you
look at and for how long.
• Scanning and presenting you with customized information as per your
behavior and habits – Structuring Data.
• Eg: Online shopping site customized recommendation set for
customers.
Data comes from multiple
resources.
Databases, ERP systems, weblogs,
chat history, GPS maps etc.
Varies in format.
Need to be made consistent and
clear before analysis.
Sources of Data-
Internal sources(Organizational/enterprise data)
External sources (Social data)
Whatever may be the source, it usually provides
Structured, Unstructured & Semi structured data.
Unstructured data is big in volume.
Structured data
Unstructured data
Data from
• Files
• Social media
• Websites
• Satellites
Data of different format
• E mails
• Text
• Audio
• Video
• Images
Data From sources like
• Documents, Logs, feedbacks,
• emails,
• social media,
• mobile data- text data, location
information
Semi structured (Schema -less or Self
describing structure)
Data doesn’t follow proper
structure od data models.
Data stored inconsistently in rows
and columns in a data base.
Elements of Big data
• Volume
• Velocity
• Variety
• Veracity
Big data Analytics
• Businesses use analytics to explore and examine their data and
then transform their findings into insights that ultimately help
executives, managers and operational employees make better,
more informed business decisions.
Three key types of analytics businesses use are
• Descriptive analytics, what has happened in a business;
• Predictive analytics, what could happen;
• Prescriptive analytics, what should happen.
• Descriptive analytics is a commonly used form of data analysis whereby
historical data is collected, organized and then presented in a way that is
easily understood.
• It uses simple Maths and Statistical tools, such as arithmetic, averages and
per cent changes, rather than the complex calculations.
• Visual tools such as line graphs and pie and bar charts are used to present
findings. Findings can be easily understood by a wide business audience.
• Descriptive analytics helps organizations measure performance to ensure
goals and targets are being met. Can identify areas that require
improvement or change.
• Examples
Summarizing past events such as sales and operations data or marketing
campaigns
Social media usage and engagement data such as Instagram or Facebook
likes
Reporting general trends
Collating survey results
• Predictive analytics is focused on predicting and understanding
what could happen in the future.
• Analyzing past data patterns and trends by looking at historical data
and customer insights can predict what might happen going
forward.
• Predictive analytics is based on probabilities. Using a variety of
techniques – such as data mining, statistical modelling and machine
learning algorithms. Uses complex algorithms.

• E-commerce – predicting customer preferences and recommending


products to customers based on past purchases and search history
• Sales – predicting the likelihood that customers will purchase
another product or leave the store
• Human resources – detecting if employees are thinking of quitting
and then persuading them to stay
• IT security – identifying possible security breaches that require
further investigation
• Healthcare – predicting staff and resource needs
• Prescriptive analytics takes what has been learned through
descriptive and predictive analysis and goes a step further by
recommending the best possible courses of action for a
business.
• If descriptive analytics tells you what has happened and
predictive analytics tells you what could happen, then
prescriptive analytics tells you what should be done.
• Helps to make the best possible decisions based on the data
available to them.
• Helps to avoid risks.
Advantages of Big data Analytics(Areas)
• Transportation
• Education
• Travel
• Government
• Healthcare
• Telecom
• Industry
• Aviation
CLOUD COMPUTING A N D BIG DATA

Cloud Computing is the delivery of computing services—servers, storage,databases,


networking, software, analytics andmore—over the Internet (“thecloud”).
Companies offering these computing services arecalled cloud providers and typically charge
for cloud computing servicesbasedon usage,similartohowyouarebilledforwaterorelectricity
athome.
CLOUD COMPUTING A N D BIG DATA
There are three main cloud service models: Software as a Service - SaaS, Platform as a Service - PaaS,
and Infrastructure as a Service IaaS.

Laptop
SaaS

Cloud
IaaS
Internet Provider
Desktop
PaaS

Mobilesor
PDAs
FEATURES OF CLOUD COMPUTING

1. Scalability –
Addition of new resources to an existing infrastructure.
• Data storage capacity, processing power and networking can all be
scaled using existing cloud computing infrastructure. Better yet, scaling
can be done quickly and easily
• A system’s scalability refers to its ability to increase workload with
existing hardware resources.
2. Elasticity –
• Elasticity refers to a system’s ability to grow or shrink dynamically
in response to changing workload demands
- no extra payment is required for acquiring specific cloud services.
- A cloud does not require customers to declare their resource
requirements in advance.
- When demand unexpectedly surges, properly configured cloud
applications and services instantly and automatically add
resources to handle the load. When the demand abates,
services return to original resource levels.
3. Resource Pooling

Resource pooling means that a cloud service provider can share resources
among several clients, providing everyone with a different set of services as
per their requirements.
Multiple organizations, which use similar kinds of resources to carry out
computing practices, have no need to individually hire all the resources.
Resources provided by the providers are shared by multiple unrelated
customers.
Pooling resources on the software level means that a consumer is not the
only one using the software.
4. Self Service – Cloud computing involves a simple user interface that
helps customers to directly access the cloud services they want.
It makes getting the resources you need very quick and easy.
In on-demand self service, the user accesses cloud services through an
online control panel.

5. Low Cost
Cloud offers customized solutions, especially to organizations that cannot
afford too much initial investment. Cloud provides pay-us-you-use option,
in which organizations need to sign for those resources only that are
essential.
6. Fault Tolerance & Data security – offering uninterrupted services to
customers.
• Data security is one of the best characteristics of Cloud Computing. Cloud
services create a copy of the data that is stored to prevent any form of data
loss. If one server loses the data by any chance, the copy version is restored
from the other server.

7. RESILIENCE
Resilience in cloud computing means the ability of the service to quickly
recover from any disruption. A cloud’s resilience is measured by how fast its
servers, databases, and network system restarts and recovers from any kind
of harm or damage.
Amazon CloudWatch provides a monitoring system that will estimate
Amazon Web Service charges.
CLOUD DEPLOYMENT MODELS

•A cloud deployment model represents a specific type of cloud


environment, primarily distinguished by ownership, size, and access.
•Cloud deployment models indicate how the cloud services are made
available to users.
•There are four common cloud deployment models:

▪ Public Cloud
▪ Private Cloud
▪ Community Cloud
▪ Hybrid Cloud
Public Cloud (End-User Level Cloud)
• As the name suggests, this type of cloud deployment model supports all
users who want to make use of a computing resource, such as hardware (OS,
CPU, memory, storage) or software (application server, database) on a
subscription basis. Most common uses of public clouds are for application
development and testing, non-mission-critical tasks such as file-sharing, and
e-mail service.
- Eg : Verizon, Amazon Web Services, and Rack space.
Company X

Cloud
Public Cloud Services(IaaS/ Company Y
PaaS/SaaS)

Company Z
Fig : Level of Accessibility in a PublicCloud
Private Cloud (Enterprise Level Cloud)
A private cloud is typically infrastructure used by a single organization. Such
infrastructure may be managed by the organization itself to support various
user groups, or it could be managed by a service provider that takes care of it
either on-site or off-site.
Private clouds are more expensive than public clouds due to the capital
expenditure involved in acquiring and maintaining them. However, private
clouds are better able to address the security and privacy concerns of
organizations today.
- Remains entirely in the ownership of the organization using it.
Private Cloud
Cloud services

Fig: Level of Accessibility in a PrivateCloud


Community Cloud
Refers to a shared cloud computing service environment that is
targeted to a limited set of organizations or employees (such as
banks or heads of trading firms).
Examples include universities cooperating in certain areas of
research, or police departments within a county or Government
offices sharing computing resources.
Access to a community cloud environment is typically restricted to
the members of the community.
Managed by third party cloud services.
-Available on or off premises.
Hybrid Cloud
• In a hybrid cloud, an organization makes use of interconnected private and
public cloud infrastructure.
• For example, if an online retailer needs more computing resources to run its
Web applications during the holiday season it may attain those resources via
public clouds.
--In hybrid clouds, an organization can use both types of cloud, i.e. public
and private together – situations such as cloud bursting.
CLOUD SERVICES FOR BIG DATA

• Infrastructure as a Service - IaaS often provides the infrastructure such as servers,


virtual machines, networks, operating system, storage, and much more on a pay-as-
you-use basis. IaaS providers offer VM from small to extra-large machines.
• Platform As a Service. – PaaS . The PaaS model is similar to IaaS, but it also provides
the additional tools such as database management system, business intelligence
services, and so on.
PaaS provides a platform for software developers to build their applications
• Software As a Service. SaaS. SaaS platforms make software available to users over
the internet, usually for a monthly subscription fee. With SaaS, you don’t need to
install and run software applications on your computer
IN MEMORY COMPUTING TECHNOLOGY

• Another wayto improve speed andprocessing power ofdata.


• In-memory computing (IMC) stores data in RAM rather than in databases hosted on disks.
• In-memory computing means using a type of middleware software that allows one to store data in RAM,
across a cluster of computers, and process it in parallel.
• In-memory computing involves an architecture where the data is kept inside the memory of the computer rather than on their
hard disks.
• By keeping the detailed data in the main memory, this model speeds up data crunching and meets diverse information and
analytics requirements faster.
• In-memory computing is the storage of information in the main random access memory (RAM) of
dedicated servers rather than in complicated relational databases operating on comparatively
slow disk drives.
• In-memory computing helps business customers, including retailers, banks and utilities, to
quickly detect patterns, analyze massive data volumes on the fly, and perform their operations
quickly.
Module 4
Hadoop Ecosystem
• Hadoop is a framework of tools.
• Open source software framework of tools designed for storage and processing of
large scale data on clusters of commodity hardware.
• Created by Doug Cutting and Mike Carafella in 2005.
• Hadoop handles a variety of workloads, including search, log processing,
recommendation systems, data warehousing, and video/image analysis.
Hadoop Approach
Hadoop runs applications using the MapReduce algorithm, where the data is
processed in parallel on different CPU nodes.
Hadoop framework is capable enough to develop applications capable of running on
clusters of computers and they could perform complete statistical analysis for a huge
amounts of data.
Hadoop Ecosystem
• Hadoop Ecosystem is neither a programming language nor a service, it is a
platform or framework which solves big data problems.
• You can consider it as a suite which encompasses a number of services
(ingesting, storing, analyzing and maintaining) inside it.
• Hadoop Ecosystem is a platform or a suite which provides various services to
solve the big data problems.
• The core components of Hadoop are
HDFS & MapReduce
Hadoop also provides various tools or solutions to supplement or support
these major elements.
All these tools work collectively to provide services such as absorption,
analysis, storage and maintenance of data etc.
Following are some of the components that collectively form a Hadoop
ecosystem:

• HDFS: Hadoop Distributed File System


• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
HDFS - Hadoop Distributed File System
• The Hadoop Distributed File System (HDFS) is the underlying file system of a
Hadoop cluster. It provides scalable, fault-tolerant, rack-aware data storage
designed to be deployed on commodity hardware.
• HDFS is the primary or major component of Hadoop ecosystem and is responsible
for storing large data sets of structured, semi structured or unstructured data
across various nodes.
Blocks
• Data in a Hadoop cluster is broken down into blocks of predefined size and
distributed throughout the cluster.
• Each block is duplicated twice (for a total of three copies), with the replicas stored
on different nodes in a rack somewhere else in the cluster.
• Since the data has a default replication factor of three, it is highly available and
fault-tolerant. If a copy is lost (because of machine failure, for example), HDFS
will automatically re-replicate it elsewhere in the cluster, ensuring that the
threefold replication factor is maintained.
• HDFS is a distributed filesystem that runs on large clusters of commodity
machines.
❖ Files are split into blocks
❖ Blocks are split across many machines at load time
❖ Different blocks from the same file will be stored on different
machines
❖ Blocks are replicated across multiple machines
❖ The NameNode keeps track of which blocks make up a file and where they
are stored.
Name nodes & Slave nodes
• Apache Hadoop HDFS Architecture follows a Master/Slave
Architecture, where a cluster comprises of a single NameNode
(Master node) and all the other nodes are DataNodes (Slave nodes).
• Name Node is the prime node which contains metadata (data about
data ) requiring comparatively fewer resources than the data nodes
that stores the actual data.
• These Data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective. The data
resides on DataNodes only.
Functions of Name Nodes
• NameNode is the centerpiece of the Hadoop Distributed File System. It maintains and manages the file system
namespace and provides the right access permission to the clients.
• The NameNode stores information about blocks locations, permissions, etc. on the local disk in the form of two
files:
• Fsimage: Fsimage stands for File System image. It contains the complete namespace of the Hadoop file system
since the NameNode creation.
• Edit log: It contains all the recent changes performed to the file system namespace to the most recent Fsimage.
• Functions of HDFS NameNode
• It executes the file system namespace operations like opening, renaming, and closing files and directories.
• NameNode manages and maintains the DataNodes.
• It determines the mapping of blocks of a file to DataNodes.
• NameNode records each change made to the file system namespace.
• It keeps the locations of each block of a file.
• NameNode takes care of the replication factor of all the blocks.
• NameNode receives heartbeat and block reports from all DataNodes that ensure DataNode is alive.
• If the DataNode fails, the NameNode chooses new DataNodes for new replicas.
Functions of Data Nodes
• These are slave daemons or process which runs on each slave machine.
• The actual business data is stored on DataNodes.
• The DataNodes perform the low-level read and write requests from the file
system’s clients.
• They send heartbeats to the NameNode periodically to report the overall
health of HDFS, by default, this frequency is set to 3 seconds.
• Data in HDFS is scattered across the DataNodes as blocks.
• This is actual worker node were Read/Write/Data processing is handled.
• Upon instruction from Master, it performs creation/replication/deletion of
data blocks.
• As all the Business data is stored on DataNode, the huge amount of
storage is required for its operation. Commodity hardware can be used for
hosting DataNode.
Secondary Name node
• The Secondary NameNode works concurrently with the primary Name Node as a helper
daemon.
• It is responsible for combining the EditLogs with FsImage from the NameNode.
• It downloads the EditLogs from the NameNode at regular intervals and applies to FsImage.
The new FsImage is copied back to the NameNode.
• Secondary NameNode performs regular checkpoints in HDFS. Therefore, it is also called
CheckpointNode.
• When the NameNode starts, the NameNode merges the Fsimage and edit logs file.
• Since the NameNode runs continuously for a long time without any restart, the size of
edit logs becomes too large. This will result in a long restart time for NameNode.
• Secondary NameNode solves this issue.
• Secondary NameNode downloads the Fsimage file and edit logs file from NameNode.
• It periodically applies edit logs to Fsimage and refreshes the edit logs.
• The updated Fsimage is then sent to the NameNode so that NameNode doesn’t have to
re-apply the edit log records during its restart.
• This keeps the edit log size small and reduces the NameNode restart time.
• If the NameNode fails, the last save Fsimage on the secondary NameNode can be used to
recover file system metadata.
• The secondary NameNode performs regular checkpoints in HDFS.
Hadoop High Availability & NameNode High Availability
architecture
• The High availability feature makes the files in HDFS accessible even in
unfavorable conditions such as NameNode failure or DataNode failure.
• Hadoop HA: In the HDFS cluster, after a definite interval of time, all these DataNodes sends
heartbeat messages to the NameNode. If the NameNode stops receiving heartbeat
messages from any of these DataNodes, then it assumes it to be dead.
• After that, it checks for the data present in those nodes and then gives commands to the
other datanode to create a replica of that data to other datanodes. Therefore data is always
available.
• When a client asks for a data access in HDFS, first of all, NameNode searches for the data in
that datanodes, in which data is quickly available.
• The HDFS NameNode HA:

• This feature enables to run redundant NameNodes(normally 2) in the same cluster in


an Active/Passive configuration. This eliminates the NameNode as a potential single
point of failure (SPOF) in an HDFS cluster.
• Active NameNode – It handles all client operations in the cluster.
• Passive NameNode – It is a standby namenode, which has similar data as active
NameNode. It acts as a slave, maintains enough state to provide a fast failover, if
necessary.
• If Active NameNode fails, then passive NameNode takes all the responsibility of
active node and cluster continues to work.
Rack Awareness of HDFS
• Concept to choose a nearby data node (closest to the client which has
raised the Read/Write request), thereby reducing the network traffic.
• The Rack is the collection of around 40-50 DataNodes connected using the
same network switch. If the network goes down, the whole rack will be
unavailable. A large Hadoop cluster is deployed in multiple racks.
• In a large Hadoop cluster, there are multiple racks. Each rack consists of
DataNodes.
• Communication between the DataNodes on the same rack is more efficient
as compared to the communication between DataNodes residing on
different racks.
• NameNode maintains rack ids of each DataNode to achieve this rack
information. This concept of choosing the closest DataNode based on the
rack information is known as Rack Awareness.
Key features of HDFS
Data Replication
Distributed Storage
High Availability
Data Integrity – As files are divided into blocks and replicated in many
machines, there is a high chance for data corruption after a very
minute variation.
Validating checksums – error detetion technique
Same Content in Module 5 also….
MapReduce
• MapReduce is a Programming model/Programming paradigm/Processing
technique/Computational component within the Hadoop framework based on
Java.
• Used to access and process the big data stored in the Hadoop File System (HDFS)
• It is a software framework for easily writing applications that process the vast
amount of structured and unstructured data stored in the HDFS.
• Initially used by Google for analyzing its search results, MapReduce gained massive
popularity due to its ability to split and process terabytes of data in parallel,
achieving quicker results.
• With MapReduce, rather than sending data to where the application or logic
resides, the logic is executed on the server where the data already resides.
• A MapReduce job usually splits the input datasets and then process each of them
independently by the Map tasks in a completely parallel manner.
Challenges associated with this traditional approach:
• Critical path problem: It is the amount of time taken to finish the job without delaying the
next milestone or actual completion date. So, if, any of the machines delay the job, the
whole work gets delayed.
• Reliability problem: What if, any of the machines which are working with a part of data
fails? The management of this failover becomes a challenge.
• Equal split issue: How will I divide the data into smaller chunks so that each machine gets
even part of data to work with. In other words, how to equally divide the data such that no
individual machine is overloaded or underutilized.
• Single split may fail: If any of the machines fail to provide the output, I will not be able to
calculate the result. So, there should be a mechanism to ensure this fault tolerance
capability of the system.
• Aggregation of the result: There should be a mechanism to aggregate the result generated
by each of the machines to produce the final output.
MAP REDUCE
• MapReduce is a programming framework that allows us to perform distributed and
parallel processing on large data sets in a distributed environment.
• MapReduce consists of two distinct tasks — Map and Reduce.
• As the name MapReduce suggests, reducer phase takes place after the mapper
phase has been completed.
• So, the first is the map job, where a block of data is read and processed to produce
key-value pairs as intermediate outputs.
• The output of a Mapper or map job (key-value pairs) is input to the Reducer.
• The reducer receives the key-value pair from multiple map jobs.
• Then, the reducer aggregates those intermediate data tuples (intermediate key-value
pair) into a smaller set of tuples or key-value pairs which is the final output.
A Word Count Example of MapReduce
• First, we divide the input into three splits as shown in the figure. This will distribute the
work among all the map nodes.
• Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to
each of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is
that every word, in itself, will occur once.
• Now, a list of key-value pair will be created where the key is nothing but the individual
words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs —
Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.
• After the mapper phase, a partition process takes place where sorting and shuffling
happen so that all the tuples with the same key are sent to the corresponding reducer.
• So, after the sorting and shuffling phase, each reducer will have a unique key and a list of
values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
• Now, each Reducer counts the values which are present in that list of values. As shown in
the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as — Bear, 2.
• Finally, all the output key/value pairs are then collected and written in the output file.
Logical Flow of Data in MapReduce
.
Input Input Shuffle
Map Combine & Sort
File Split

Reduce

Output
1. Input File
• The data for a MapReduce task is stored in input files which typically lives in HDFS.
• Files contains both Structured and Unstructured data.
2. Input Split
• Hadoop framework divides the huge input file into smaller chunks/blocks, these chunks
are referred as input splits.
• For each input split Hadoop creates one map task to process records in that input split.
That is how parallelism is achieved in Hadoop framework.
3. Map
• Mapper class contains coding logic functions.
• The conditional logic is applied to the ‘n’ number of input blocks/splits spread across
various data nodes.
• Mapper function output will be in key-value format as (k, v), where the key represents
the offset address of each record and the value represents the entire record content.
4. Combine
• Combiner is Mini-reducer/Semi reducer which performs local aggregation on the mapper's output.
• It is an optional phase.
• The job of the combiner is to optimize the output of the mapper before its fed to the reducer in
order to reduce the data size that is moved to the reducer.
• In this phase, various outputs of the mappers are locally reduced at the node level.
5. Shuffle & Sort
• The key value pair output of various mappers (k, v), goes into Shuffle and Sort phase.
• All the duplicate values are removed, and different values are grouped together based on similar
keys.
• The output of the Shuffle and Sort phase will be key-value pairs again as key and array of values (k,
v[]).
6. Reduce
• The output of the Shuffle and Sort phase (k, v[]) will be the input of the Reducer phase.
• In this phase reducer function’s logic is executed and all the values are aggregated against their
corresponding keys.
• Reducer consolidates outputs of various mappers and computes the final job output.
7. Output
The final output is then written into a single file in an output directory of HDFS
Features of MapReduce
• Scheduling (of tasks among nodes based on the availability)
Mapping – Dividing big tasks into subtasks and assigns to individual nodes in cluster and
executes parallelly. So MapReduce model requires scheduling.
• Synchronization (among running subtasks)
Accomplished by a barrier between the map and reduce phases of processing.
Synchronization, refers to the mechanisms that allow multiple concurrently running processes
to "join up“. for example, to share intermediate results or exchange state information.
• Co location of Code/Data
In order to achieve data locality, the scheduler starts tasks on the node that holds a particular
block of data needed by the task.
• Handling Errors/Faults
High chances for failure of running nodes. MapReduce engine has the capability to recognize
and rectify the faults effectively. It also identifies the incomplete tasks and reassigns it to other
available nodes.
Benefits of MapReduce
1. Fault-tolerance
• During the middle of a map-reduce job, if a machine carrying a few
data blocks fails architecture handles the failure.
• It considers replicated copies of the blocks in alternate machines for
further processing.
2. Resilience
• Each node periodically updates its status to the master node.
• If a slave node doesn’t send its notification, the master node
reassigns the currently running task of that slave node to other
available nodes in the cluster.
3. Quick
• Data processing is quick as MapReduce uses HDFS as the storage
system.
4. Parallel Processing
• MapReduce tasks process multiple chunks of the same datasets in-
parallel by dividing the tasks.
• This gives the advantage of task completion in less time.
5. Availability
• Multiple replicas of the same data are sent to numerous nodes in the
network.
• Thus, in case of any failure, other copies are readily available for
processing without any loss.
6. Scalability
• MapReduce lets you run applications from a huge number of nodes,
using terabytes and petabytes of data by accommodating new nodes if
needed.
HBase
• HBase is a column-oriented NoSQL database management system that runs on
top of HDFS (Hadoop Distributed File System) efficient for structured data
storage and processing.
• Initially, it was Google Big Table, afterward; it was renamed as HBase and is
primarily written in Java.
• Apache HBase is needed for real-time Big Data applications.
• HBase is used extensively for random read and write operations
• HBase stores a large amount of data in terms of tables
• HBase stores data in the form of key/value pairs in a columnar model. In this
model, all the columns are grouped together as Column families.
• HBase on top of Hadoop will increase the throughput and performance of
distributed cluster set up. In turn, it provides faster random reads and writes
operations.
• One can store the data in HDFS either directly or through HBase.
• Good for Structured as well as Semi structured data.
Hbase table Columns & Rows
• HBase is a column-oriented, non-relational database. This means that data is
stored in individual columns, and indexed by a unique row key.
• Table is a collection of rows.
• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.

You might also like