0% found this document useful (0 votes)
9 views110 pages

Big Data

Big Data refers to the vast amounts of data generated at high speed and in various formats, which traditional data management systems struggle to analyze. It encompasses structured, unstructured, and semi-structured data, and requires specialized tools like Hadoop for storage and analysis. The four key characteristics of Big Data are volume, velocity, variety, and veracity, which together define its complexity and the need for advanced analytics techniques.

Uploaded by

habeebjaffer3883
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views110 pages

Big Data

Big Data refers to the vast amounts of data generated at high speed and in various formats, which traditional data management systems struggle to analyze. It encompasses structured, unstructured, and semi-structured data, and requires specialized tools like Hadoop for storage and analysis. The four key characteristics of Big Data are volume, velocity, variety, and veracity, which together define its complexity and the need for advanced analytics techniques.

Uploaded by

habeebjaffer3883
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

What is Big Data ?

>>We live in a digital world where data is increasing rapidly because of the
ever increasing use of the internet, sensors, and heavy machines at a very high
rate.
>>The sheer volume, variety, velocity, and veracity of such data is
signified by the term ‘Big Data’.
>>Big data is structured , unstructured, semi-structured or
heterogeneous(varied, mixed) in nature.
>>It becomes difficult for computing systems to manage ‘ Big Data’ because
of the immense(huge) speed and volume at which it is generated.
>>Traditional data management, warehousing , and analysis systems fizzle to
analyze this type of data.
>>Due to its complexity, big data is stored in distributed architecture file
system.
What is Big Data ?cntd…
>>Hadoop by Apache is widely used for storing and ,managing Big
data.
>>Analyzing Big data is a challenging task as it involves large
distributed file systems, which should be fault tolerant (Is a property of
a system that maintains continuous running of service even during
faults or a process that enables an operating system to respond to a
failure in hardware or software), flexible, and scalable.
>>According to IBM, "Every day, we create 2.5 quintillion bytes of data
– so much that 90% of the data in the world today has been created in
the last two years alone.
>>This data comes from everywhere: sensors used to gather climate
information, posts to social media sites, digital pictures and videos,
purchase transaction records and cell phone GPS signals to name a few .
This data is big data.
What is Big Data ?cntd…
>>The process of capturing or collecting Big data is known as
‘datafication’.
>>‘Big data is datafied’ so that it can be used productively.
>>Big data cannot be made useful by simply organizing it, rather the
data’s usefulness lies in determining what we can do with it.
Note: By large or huge datasets or big data, we mean anything from a
petabyte(1PB= 1000TB) to an exabyte (1EB=1000PB)of data.
Some real world examples of Big data include:
Some real world examples of Big data include cntd :
Types and sources of Data :
Overview of Big Data, Techniques :
Structuring Big Data :
>>Structuring of data, is arranging the available data in a manner
such that it becomes easy to study , analyze , and derive conclusion
from it. But , why is structuring required?
>>In daily life, you may have come across questions like:
1. How do I use to my advantage the vast amount of data and information I
come across?
2. Which news articles should I read of the thousands I come across?
3. How do I choose a book of the millions available on my favorite sites or
stores?
4. How do I keep myself updated about new events , sports, inventions, and
discoveries taking place across the globe?
>>Solutions to such questions can be found by information
processing systems(IPS).
>>These systems can analyze and structure a large amount of data
specifically for you on the basis of what you searched , what you looked
at, and for how long you remained at a particular page or website, thus
scanning and presenting you with the customized information as per
your behavior and habits.
>>In other words , structuring data helps in understanding user
behaviors, requirements, and preferences to make personalized
recommendations for every individual.
>>When a user regularly visits or purchases from online shopping
site, say eBay, each time he/she logs in , the system can present a
recommended list of products that may interest the user on the basis of
his/her earlier purchases or searches , thus presenting a specially
customized recommendation set for every user.
>>This is the power of Big data analytics.
>>Today, various sources generate a variety of data, such as images,
text, audios etc.
>>All such different types of data can be structured only if it is sorted
and organized in some logical pattern.
>>Thus, the process of structuring data requires one to first
understand the various types of data available today.
Types of data
1. Internal [ provides structured or organized data originates within
enterprise and helps run business]
2. External [ Provides unstructured or unorganized data that originates
from the external environment of an organization]
On the basis of the data received from the sources , Big Data
comprises:
• Structured data
• Unstructured data
• Semi-structured data
Types of Big data
• Structured data
• Unstructured data
• Semi-structured data
Structured data
>>Structured data can be defined as the data that has a defined repeating
pattern .
>>This pattern makes it easier for any program to sort, read and process the
data.
>>Processing structured data is much easier and faster than processing data
without any specific repeating patterns.
Structured data :
.. Is organized data in a predefined format.
..Is stored in tabular form
..Is the data that resides in fixed fields within a record or file.
..Is formatted data that has entities and their attributes mapped.
..Is used to query and report against predetermined data types
Some sources of structured data include:
>>Relational databases(in the form of tables)
>>Flat files in the form of records(Like comma separated values(csv) and
tab-separated files)
>>Multidimensional databases (majorly used in data warehouse technology)
>>Legacy databases .
Unstructured data
>>Unstructured data is a set of data that might or might not
have any logical or repeating patterns.
Unstructured data:
..consists typically of metadata, i.e., the additional information
related to data.
..comprises inconsistent data, such as data obtained from files,
social media websites, satellites etc.
..Consists of data in different formats such as e-mails, text,
audio, video, or images.
Some sources of unstructured data include:
>>Text both internal and external to an organization-
Documents, logs, survey results, feedbacks, and e-mails from
both within and across the organization.
>>Social media : data obtained from social networking
platforms, including YouTube, Facebook, Twitter, LinkedIn
and Flickr.
>>Mobile data- data such as text messages and location
information.
About 80 percent of enterprise data consists of unstructured
content.
>>Unstructured data examples. There is a wide array of forms
that make up unstructured data such as email, text files, social
media posts, video, images, audio, sensor data, and so on.
>>The travel agency Facebook post: an example of
unstructured data.
Semi-Structured data :
>>Semi-structured data, also known as having a schema-less
or self-describing structure, refers to a form of structured data
that contains tags or markup elements in order to separate
elements and generate hierarchies of records and fields in the
given data.
>>Such type of data does not follow the proper structure of
data models as in relational databases.
>>In other words , data is stored inconsistently in rows and
columns of a database.
>>Some sources for semi-structured data include:
..File systems such as Web data in the form of cookies.
..Data exchange formats such as JavaScript Object
Notation(JSON)data.
…Another example, an XML document might contain tags
that indicate the structure of the document , but may also
contain additional tags that provide metadata about the content,
such as author, date, or keywords.
Elements of Big Data
>>According to Gartner , data is growing at the rate of
59% every year.
>>This growth can be depicted in terms of the following
four Vs:
* Volume
* Velocity
* Variety
* Veracity
1. Volume
>>Volume is the amount of data generated by organizations or
individuals.
>>Today, the volume of data in most organizations is approaching
exabytes[1000 petabytes].
>>Some experts predict the volume of data to reach zettabytes in the
coming years.
>>Organizations are doing their bets to handle this ever-increasing
volume of data.
>>For example, according to IBM, over 2.7 zettabytes of data is
present in the digital universe today.
>>Every minute over 571 new websites are being created.
>>IDC [Infrastructure development charges (IDCs)]estimates that by
2020 , online business transactions will reach up to 450 billion per day.
2. Velocity
>>Velocity describes the rate at which data is generated,
captured and shared.
>>Enterprises can capitalize on data only if it is captured
and shared in real time.
>>Information processing systems such as CRM and ERP
face problems associated with data, which keeps adding up
but cannot be processed quickly.
The sources of high velocity data include the following:
>>IT devices, including routers , switches, firewalls etc., constantly
generate valuable data.
>>Social media, including Facebook posts, tweets, and other social
media activities.
>>Portable device, including mobile, PDA, etc., also generate data at a
high speed.
3. Variety
>>We all know that data is being generated at a very fast
pace.
>>Now, this data is generated from different types of sources,
such as internal , external, social and behavioral, and comes in
different formats such as images, text, videos, etc.
>>Even a single source can generate data in varied formats,
for example, GPS and social networking sites, such as
Facebook, produce data of all types, including text, images,
videos, etc.
>>Various types of data included in the following figure;
4. Veracity
>>Veracity generally refers to the uncertainty of data i.e.,
whether the obtained data is correct or consistent.
>>Out of the huge amount of data that is generated in almost
every process, only the data that is correct and consistent can be
used for further analysis.
>>data when processed becomes information, however, a lot of effort
goes in processing the data.
>>Big data , especially in the unstructured and semi-structured forms,
is messy in nature, and it takes a good amount of time and expertise to
clean that data and make it suitable for analysis.
Big data Analytics
>>Big data analytics changed the ways to conduct business in
many ways, such a it improves, decisions making, business
process management etc.
>>Business analytics uses the data and different other
techniques like information technology, features of statistics
, quantitive methods, and different predictive analytics, and
prescriptive analytics.
>>There are three main types of business analytics :
descriptive analytics, predictive analytics, and prescriptive
analytics.
Big data Analytics cntd..
>>The conventional database systems are not in a position to process
Big data defined by the four Vs: volume, variety, velocity, and veracity.
>>Big data also affects the analytical process and technologies used for
analytics.
>>There are mainly three types of analytics:
1. Descriptive Analytics: DA is the most prevalent form of
analytics , and it serves as a base for advanced analytics.
>>It answers the question ‘What happened in the business?’
>>DA analyses database to provide information on the trends of past or
current business events that can help managers, planners, leaders etc.
to develop a road map for future actions.
>>DA performs an in-depth analysis of data to reveal details such as
frequency of events, operation costs, and the underlying reason for
failures.
>>It helps in identifying the root cause of the problem.
2. Predictive Analytics –
>>PA is about understanding an predicting the future and
answers the question ‘What could happen?’ by using statistical
models and different forecast techniques.
>>It predicts the near future probabilities and trends and helps
in what –if analysis .
>>In PA , we use statistics, data mining techniques, and
machine learning to analyze the future.
>>The below figure shows the steps involved in predictive analytics:
3. Prescriptive Analytics –
>> Prescriptive analysis answers ’What should we do’ , on the
basis of complex data obtained from descriptive and predictive
analyses.
>>By using the optimization technique, prescriptive analytics
determines the finest substitute to minimize or maximize some
equitable finance, marketing, and many other areas.
>>For e.g. if we have to find the best way of shipping goods
from a factory to a destination, to minimize costs, we will use
the prescriptive analytics.
3. Prescriptive Analytics –cntd….
>>The below figure shows a diagrammatic representation of the
stages involved in the prescriptive analytics:
3. Prescriptive Analytics –
>>Data, which is available in abundance, can be streamlined for
growth and expansion in technology as well as business.

>>When data analysed successfully, it become the answer to


one of the most questions; how can businesses acquire more
customers and gain business insight? The key to this problem
lies in being able to source, link, understand, and analyse data.
Advantages of Big data Analytics…
1. Procurement
2. Product Development
3. Manufacturing
4. Distribution
5. Marketing
6. Price management
7. Merchandising
8. Sales
9. Store operations
10. Human Resources
Application of Big Data
1. Transportation
2. Education
3. Travel
4. Government
5. Healthcare
6. Telecom
7. Consumer Goods Industry
8. Aviation Industry
Application of Big Data cntd..
1. Transportation
>>BD has greatly improved transportation services.
>>The data containing traffic information is analyzed to identify
traffic jam areas.
>>Suitable steps can then be taken, on the basis of this analysis, to keep
the traffic moving in such areas.
>>Distributed sensors are installed in handled devices, on the roads and
on vehicles to provide real-time traffic information. This information is
analyzed and disseminated to commuters and also to the traffic control
authority.
Application of Big Data cntd..
2. Education
>>BD has transformed the modern-day education
processes through innovative approaches, such as
e-learning for teachers to analyze the students ability
to comprehend and thus impart education effectively
in accordance with each students needs.
>>The analysis is done by studying the responses to
questions, recording the time consumed in attempting
those questions, and analysing other behavioral signals of
the students.
Application of Big Data cntd..
3. Travel
>>The travel industry also uses Big Data to conduct
business.
>>It maintains complete details of all the customer records
that are then analyzed to determine certain behavioral
patterns in customers.
>>for.eg. In the airline industry, Big Data is analyzed for
identifying personal preferences or spotting which passengers
like to have window seats for short-haul flights and aisle seats
for long-haul flights.
>>This helps airlines to offer the similar seats to customers
when they make a fresh booking with the airways.
Application of Big Data cntd..
4. Government
>>According to UK free market , “ the UK government
could save up to €33 billion a year by using public Big data
more effectively.”
>>Analysis of Big Data promotes clarity and transparency in
various government processes and helps in :
* taking timely and informed decisions about various issues.
* Identifying flaws and loopholes in processes and taking
preventive or corrective measures on time.
*Preventing fraudulent practices in various sectors etc…
Application of Big Data cntd..
5. Healthcare
>>In healthcare, the pharmacy and medical device
companies use Big data to improve their research and
development practices, while health insurance companies
use it to determine patient- specific treatment therapy modes
that promise the best results.
6. Telecom
>>The mobile revolution and the Internet usage on mobile
phones have led to a tremendous increase in the amount of
data generated in the telecom sector.
>>Managing this huge pool of data has almost become a challenge for
the telecom industry.
>>For.eg. in Europe , there is a compulsion on the telecom companies to
keep data of their customers for at least six months and maximum up to
two years . Now, all this collection, storage, and maintenance of data
would just be a waste of time and resources unless we could derive any
significant benefits from this data.
>>Big Data analytics allows telecom to utilize this data for extracting
information used to gain crucial insights help industries in enhancing
their performance, improving customer services, maintaining hold on
market, generating more business opportunities.
7.Consumer Goods Industry
>>Consumer goods companies generate huge volumes of data in
varied formats from different sources, such as transactions, billing
details, feedback forms etc.,
>>This data needs to be organized and analysed in a systematic
manner in order to derive any meaningful information from it.
>>For.eg; the data generated from the Point-of-Sale(POS) systems
provides significant real-time information about customers preferences ,
current market trends , the increase and decrease in demand of different
products at different regions etc…
>>This helps to predict any fluctuations in prices of goods and make
purchases accordingly.
8. Aviation Industry
>>Like other industry, the aviation industry also maintains
a detailed record of all their customers that includes their
personal information, flying preferences, and other trends
and patterns.
>>The organization analyzes this data to improve their customer
services, and thus brand image.
>>in addition, every aircraft generates a significant amount of
data during operation.
>>This data analysed for enhancing operational efficiencies ,
identifying parts that require repairs and taking necessary
constructive measures on time.
Distributed and Parallel Computing for Big Data
** Distributed Computing;
>>In DC, multiple computing resources are connected in a
network and computing tasks are distributed across these
resources.
>>This sharing of tasks increases the speed as well as the
efficiency of the system. Because of reason, the DC is
considered faster and much more efficient than traditional
methods of computing.
>It is also more suitable to process huge amounts of data in a
limited time.
** Distributed Computing;
** Parallel Computing;
>>Another way to improve the processing capability of a
computer system is to add additional computational
resources to it.
>>This will help in dividing complex computations into
subtasks , which can be handled individually by processing
units that are running in parallel.
>>we can call such systems are parallel systems in which
multiple parallel computing resources are involved to carry out
calculations simultaneously .
>>The concept behind involving multiple parallel resources is
that the processing capability will increase with the increase
with the increase in the level of parallelism .
** Parallel Computing;
** Organizations use both parallel and distributed
computing techniques to process Big Data.
** Important constraint for businesses is time, In case there was
no restriction on time, every organization hire outside(or
third-party)sources to perform the analysis of its complex data.
>>The direct benefit of adopting this method is that the
organization would not require any resources and data sources
to process and analyze complex data.
>>third parties are usually specialized agencies in the field of
data manipulation, processing and analysis.
>>hiring third party agencies reduces the storage and processing
costs of handling large amounts of data.
The following elaborates the processing of a large dataset
in a distributed computing environment:

Fig: Distributed Computing Technique for processing large data


>>In the above figure , the nodes are arranged within a
system along with the elements that form core of
computing resources.
>>Theses resources include CPU, memory, disks etc.. Big
data systems usually have higher scaling requirements
>>SO, theses nodes are more beneficial for adding
scalability to the big data environment., as and when
required.
>>The system with added scalability can accommodate
the growing amounts of data more efficiently and
flexibly.
>>DC technique also makes use of virtualization and
load balancing features.
>>The sharing of workload across various systems
throughout the network to manage the load is known as
load balancing,
>>The virtualization feature creates a virtual
environment in which hardware platform, storage device,
and Operating System(OS) are included.
>>Parallel computing technology uses a number of
techniques to process and manage huge amounts of data
produced at a high velocity, some of these techniques
are shown below :
The table below shows how parallel systems are different
from distributed systems.
Distributed System Parallel System

An independent, autonomous system A computer system with several


connected in a network for accomplishing processing units attached to it.
specific tasks.

Coordination is possible between A common shared memory can be directly


connected computers that have their own accessed by every processing unit in a
memory and CPU network

Loose coupling of computers connected in Tight coupling of processing resources


a network, providing access to data and that are used for solving a single, complex
remotely located resources. problem.
How data models and computing models are different?
>>There are several key differences between the two
infrastructures with respect to computing model and data models
in a distributed architecture.
** Distributed databases:
Deals with tables and relations
Must have a schema for data
Implements data fragmentation and partitioning
The following figures show the distributed databases and
Hadoop.
** Hadoop:
Deals with flat files in any format
Operates on no schema for data
Divides files automatically into blocks
Hadoop
>>Note : Hadoop is an open-source software framework
that is used for storing and processing large amounts of
data in a distributed computing environment. It is
designed to handle big data and is based on the Map
Reduce programming model, which allows for the
parallel processing of large datasets.
Hadoop
>>The technology designed to process Big Data(which
is a combination of both structured and unstructured
data available in huge volumes) is known as Hadoop.
>>Hadoop is an open-source platform that provides
analytical technologies and computational power
required to work with such large volumes of data.
>>Hadoop platform provides an improved programming
model, which is used to create and run distributed systems
quickly and efficiently.
Hadoop cntd…
>>A Hadoop cluster consist of single MasterNode and
multiple worker nodes .[Node is a point of connection
within a network i.e., a server]
>>master node contains a NameNode and JobTracker and
a slave or worker node acts as both a DataNode and
TaskTracker.
>>Hadoop requires Java Runtime
Environment(JRE)1.6 or a higher version of JRE.
Hadoop cntd…
>>In a large cluster, the HDFS is managed through a
Namenode server to host the file system index and a
secondary Namenode that keeps snapshots of the
Namenodes and at the time of failure of Name Node the
secondary NameNode replaces the primary NameNode,
thus preventing file system from getting corrupt and
reducing data loss..
Hadoop cntd…
>>The following figure shows Hadoop multinode cluster
architecture :[checkpoint is the merge of the last changes made on file system with
the most recent FSImage.]
Hadoop cntd…
Eg : Apache Hadoop is an open source framework that
is used to efficiently store and process large datasets
ranging in size from gigabytes to petabytes of data.
Instead of using one large computer to store and
process the data, Hadoop allows clustering multiple
computers to analyze massive datasets in parallel more
quickly.
HDFS and MapReduce
>>Two main components of Apache Hadoop – the Hadoop
Distributed File System(HDFS) and the MapReduce parallel
processing framework.
>>Both of these open source projects, HDFS is used for
storage and MapReduce is used for processing.
>>Hadoop includes fault-tolerant storage system called HDFS.
It store large files from terabytes to petabytes across different
terminals.
>>HDFS attains reliability by replicating the data over multiple
hosts.
HDFS and MapReduce
>>>>MapReduce is a framework that helps developers to write
programs to process large volumes of unstructured data parallel over
a distributed architecture, which produce result in a aggregated form.
>>MapReduce consist of several components as follows;
1. Job Tracker :Master node that manages all jobs and
resources in a cluster of commodity computers
2. Task Trackers: Agents deployed at each machine in the
cluster to run the map and reduce task at the terminal .
3. JobHistory Server :component that tracks completed jobs
>>In cloud based platforms , applications can easily
obtain the resources to perform computing tasks. The
costs of acquiring these resources need to be paid as per
the acquired resources and their use.
>>In cloud computing , this feature of resource
acquisition(gaining) is in accordance with the requirements
and payment of cost and is known as elasticity.
>>Cloud computing makes it possible for organizations to
dynamically regulate the use of computing resources and access
them as per the need while paying only for those resources that
are used.
Features of cloud computing :
1. Scalability : Scalability means addition of new resources to
an existing infrastructure.
2. Elasticity : Elasticity in cloud means hiring certain
resources, as and when required , and paying for the
resources that have been used.
3. Resource Pooling : Is an important aspect of cloud
services for Big data analytics. In resource pooling ,
multiple organizations , which use similar kinds of
resources to carry out computing practices, have no need to
individually hire all the resources.
Features of cloud computing :
4. Self Service :
Cloud computing involves a simple user interface that helps
customers to directly access the cloud services they want.
5. Low Cost :
A careful planning, use management and control or resources
help organizations to reduce the cost of acquiring hardware
significantly.
6. Fault Tolerance :
Cloud Computing provides fault tolerance by offering
uninterrupted services to customers, especially in cases of
component failure.
Cloud Services for Big Data
>>In Big data, the IaaS, PaaS, and SaaS clouds are used in the
following manner;
1. IaaS : The huge storage and computational power
requirements for Big Data are fulfilled by the limitless
storage space and computing ability obtained by the IaaS
cloud.
2. PaaS : PaaS offerings of various vendors have started
adding various popular Big Data platforms that include
MapReduce and Hadoop . These offerings save
organizations from a lot of hassle(Problems), which may
occur in managing individual hardware components and
software applications.
3. SaaS : Various organizations require identifying and analysing
the voice of customers, particularly on social media platforms.
>>The social media data and the platform for analysing the data are
provided by SaaS vendors.
In- Memory Computing Technology for Big Data
>>The computational speed and power of processing data is to use
IMC. The representation of data in the form of rows and columns
makes data processing easier and faster.
>>IMC is used to facilitate high-speed data processing. For eg; IMC
can help in tracking and monitoring consumer’s activities and
behaviors, which allow organizations to take timely actions for
improving customer services and thus, customer satisfaction.
>>In the IMC Technology, the RAM (primary storage space )
is used for analysing data. RAM helps us to increase the
computing speed.
>>Simultaneously , the reduction of primary storage cost has
also made it possible to store data in the primary memory.
>>The application finds data in the same location where it
resides. Therefore, the analysis of data can be carried out in a
more quick and efficient manner.
Hadoop Ecosystem
>> Is a framework of various types of complex and evolving
tools and components. Some of these elements may be very
different from each other in terms of their architecture;
>>The Hadoop ecosystem can be defined as a comprehensive
collection of tools and technologies that can be effectively
implemented and deployed to provide Big Data solutions in a
cost- effective manner.
>>MapReduce and HDFS are two core components of the
Hadoop ecosystem that provide a great starting point to manage
Big Data,
Hadoop Ecosystem
>>Along with these two, the Hadoop ecosystem provides a
collection of various elements to support the complete
development and deployment of Big Data solutions.
Figure : depicts the elements of the Hadoop ecosystem
Hadoop Ecosystem
>>All these elements enable users to process large datasets in
real time and provide tools to support various types of Hadoop
projects, schedule jobs and manage cluster resources.
HDFS
>>File systems like HDFS are designed to manage the
challenges of Big Data.
>>Being core components , Hadoop MapReduce and HDFS
are always being enhanced and hence they provide greater
stability.
>>In case of Hadoop, the base consists of HDFS and
MapReduce , both give fundamental structure and
integration services required to help core condition of big
data systems. The rest of ecological system provides
segments you need to build and oversee goal-driven ,
real-time big data applications.
Some terms or concepts related to HDFS ..
>>Huge documents
>>Streaming information access
>>Appliance hardware
>>Low-latency information access
>>Loads of small documents
HDFS Architecture
>>HDFS has a master – slave architecture.
>>It comprises a NameNode and a number of DataNodes.
>>The NameNode is the master that manages the various
DataNodes.(as shown in the below figure).
HDFS Architecture

>>The NameNode manages HDFS cluster metadata, whereas


DataNodes store the data.
>>Records and directories are presented by clients to the
NameNode .
>>These records and directories are managed on the
NameNode.
>>It performed the operations like, modification , or opening
and closing etc..
HDFS Architecture

>>Internally a file is divided into one or more blocks, which are


stored in a group of DataNodes .
>>DataNodes read and write requests from the clients.
>>DataNodes execute operations like the creation, deletion,
and replication of blocks, depending on the instructions from
the NameNode.
NameNodes and DataNodes
>>NameNode deals with the file system.
>>It stores the metadata for all the documents and indexes in the file
system.
>>This metadata is stored on the local disk as two files : file system
image and edit log.
>>DataNodes are the workhorses of a file system.
>>They store and recover blocks when they are asked to (by clients
or the NameNode), and they report back to the NameNode
occasionally with a list of blocks that they store externally.
HDFS Commands
Some commonly used HDFS commands are shown in the following
Table.
HDFS Commands
Some commonly used HDFS commands are shown in the following
Table.
Features of HDFS
>>Data replication, data resilience and data integrity are the three
key features of HDFS.
>>HDFS ensures data integrity throughout the cluster with the
help of the following features.
1. Maintaining Transaction Logs : HDFS maintains transaction logs
in order to monitor every operation and carry out effective auditing
and recovery of data in case something goes wrong.(records, db.
modifications, history of actions executed by a database
management s/m)
2. Validating Checksum : Checksum is an effective error-detection
technique wherein a numerical value is assigned to a transmitted
message on the basis of the number of bits contained in the message
.
3. Creating Data Blocks : HDFS maintains replicated copies of data
blocks to avoid corruption of a file due to failure of a server .
>>Data Blocks are sometimes, also called block servers. A block server
primarily stores data in a file system and maintains the metadata of a
block.
MapReduce
>>The algorithms developed and maintained by the Apache
Hadoop project are implemented in the form of Hadoop
MapReduce, which can be assumed analogous(equivalent) to
an engine that takes data input, processes it, generates the
output, and returns the required answers.
>>MapReduce is based on the parallel programming
framework to process large amounts of data dispersed across
different systems.
>>MapReduce facilitates the processing and analyzing of
both unstructured and semi-structured data collected from
different sources, which may not be analysed effectively by
other traditional tools.
MapReduce
>>MapReduce enables computational processing of data stored
in a file system without the requirement of loading data initially
into a database.
>>It supports 2 operations: map and reduce.
>>These operations execute parallel on a set of worker nodes.
>>MapReduce works on a master/ worker approach in which
the master process controls and directs entire activity, such as
collecting, segregating and delegating data among different
workers.
MapReduce
Hadoop YARN
>>Yet Another Resource Negotiator
>>Is a core Hadoop service that supports two major
services Global resource management
(ResourceManager) and Per-application management
(ApplicationMaster).
>>YARN is an improvement of second generation
Hadoop ecosystem(Hadoop 2.0).
>>YARN sometimes called, MapReduce version 2,
which is a part of core Hadoop project in Apache
Software Foundations distributed processing framework.
Hadoop YARN
>>YARN is a key element of Hadoop data processing
architecture that provides different data handling mechanisms
, including interactive SQL and batch processing.
>>It improves performance of data processing in Hadoop by
separating resource management and scheduling capabilities
of MapReduce from its data processing component .
>>Background of YARN is in 2012,the Hadoop project was
upgraded to introduce a new architecture, known as YARN,
which provides a more general purpose data processing
framework.
>>This framework supports MapReduce model as well as
some other data processing models based on the distributed
processing of data.
Hadoop YARN Advantages….
Hadoop YARN advantages….
YARN Architecture……
>>In Hadoop 2 , Resourcemanager shares the overall
responsibility of controlling and managing resources and the
ApplicationManager component allows a cluster to handle
multiple applications at a time.

>>Each application in the cluster will have its own


ApplicationManager instance.

>>ResourceManager and per-node slave(also called


nodemanager ) together manage applications in a distributed
manner by forming data-computation framework.
>>Two primary components of YARN :
ResourceManager and ApplicationManager
ResourceManager :
>>ResourceManager in YARN architecture, is the supreme
authority that controls all the decisions related to resource
management and allocation .
>>It has a Scheduler Application programming Interface(API)
that negotiates and Schedules resources.
>>The resource manager in YARN is to optimize the utilization
of resources all the time by managing all the restrictions, which
involve capacity guarantees , fairness in allocation of resources
etc..
>>Resourcemanager performs all its tasks in integration with
NodeManager and ApplicationManager components.
>>Resources on each node are allocated and managed by its
respective NodeManager .
>>The resource Manager give instructions to the NodeManager
, which is responsible for managing resources available on the
node it manages.
>>Similarly for each application, ApplicationManger instance
that negotiates resources with the ResourceManger and, in
association with NodeManager instances, starts the
containers(physical resources).
ApplicationManager
>>every instance of an application running within YARN is
managed by an ApplicationManager, which is responsible for
the negotiation of resources with the ResourceManger.
>>It keeps track of the availability and consumption of container
resources ie, CPU, memory, etc. and provides fault tolerance for
resources.
>>Integration of components, ResourceManger needs
information about resources needed for each application. This
information helps ResourceManger to negotiate for resources
with ApplicationManager and provide optimum resource
utilization.
>>This requesting for the appropriate resource is known as
ResourceRequest.
Containers :
>>A container is nothing but a set of physical resources on a single
node.
>>A container consists of memory(RAM), CPU cores and disks.
>>Depending upon the resources in a node, a node can have multiple
containers that are assigned to a specific ApplicationManager.
>>A container thus represents a resource on a single node in a given
cluster.
>>A container is supervised by the NodeManager and scheduled by
the ResourceManager .
>> ApplicationManager itself is launched in a container, which is
referred as container 0
NodeManager :
>>Is a per-machine slave, which is responsible for launching
the applications containers, monitoring their resource
usage(CPU, memory , disk , network etc…) and reporting the
status of the resource usage to the services within the cluster.
>>Nodemanager manages each node within a YARN cluster.
Working of YARN
>>The overall process flow within YARN begins with a
request from a client that consists of an application.
>>Resource Manager negotiates necessary resources for a
container and launches an AppliationManager .
>>ApplicationManager negotiates resource containers for the
application at each node by sending the Resource Request.
The following steps explain the working of YARN:
• A client program submits the application to
ResourceManager.
• Now, we need a container where we can launch the
ApplicationManager
Working of YARN
• The ApplicationManager then registers itself with
ResourceManager.
• As ApplicationManager needs resources to complete its work,
• The resourcemanager allocates the appropriate container that is
specific to the application.
• Now that the container is allocated, Applicationmanager launches
the container by providing the container launch specification to
the Nodemanager.
• The application code executing within the container provides
necessary information that includes progress, status , resource
availability, etc, to its applicationmanager.
• On completion of the submitted application, applicationmanager
deregisters the container with the resourcemanager.
Working of YARN
• YARN permits the simultaneous execution of a variety of
programming models including iterative processing, graph
processing, machine learning and general cluster computing
with help of application-specific applicationmanger.
Working of YARN
• The below figure shows the working of YARN
YARN SCHEDULERS
>>Two types of schedulers are known as,
1. Capacity Scheduler
2. Fair Scheduler

Capacity Scheduler
>>Is the default scheduler used in Hadoop 2 .
>>Its purpose is to allow multi-tenancy and share resources
between multiple organizations and applications on the same
cluster.
Capacity Scheduler cntd..
>>It supports the following features:
1. Hierarchical queues
2. Capacity guarantees
3. Security
4. Elasticity
5. Multi-tenancy
6. Resource-based scheduling
Fair Scheduler
>>Is a method of assigning resources to applications via
Application Manger such that all applications get an
equal share of resources during their course of running.
YARN Commands
Administration commands: are used by the cluster
administrator.
User commands : These types of commands are used by
the cluster user.
Fair Scheduler
>>Is a method of assigning resources to applications via
Application Manger such that all applications get an
equal share of resources during their course of running.
YARN Commands
Administration commands: are used by the cluster
administrator.
User commands : These types of commands are used by
the cluster user.

You might also like