Unit 1
Unit 1
YEAR/SEM: IV / VIII
Unit-I
Regulation: 2017
1
Part - A
UNIT I INTRODUCTION TO BIG DATA
Evolution of Big data - Best Practices for Big data Analytics - Big data characteristics - Validating
- The Promotion of the Value of Big Data - Big Data Use Cases- Characteristics of Big Data
Applications - Perception and Quantification of Value -Understanding Big Data Storage - A
General Overview of High Performance Architecture - HDFS – Map Reduce and YARN - Map
Reduce Programming Model
Many it tools are available for big data projects. Organizations whose data workloads are
constant and predictable are better served by traditional database whereas organizations
challenged by increasing data demands will need to take advantage of Hadoop’s scalable
infrastructure.
2. What is analysis?
It is the process of exploring data and reports in order to extract meaningful insights which can
be used to better understand and improve business performance.
Map Reduce provides a data parallel programming model for clusters of commodity machines. It
is pioneered by Google which process 20PB of data per day. Map Reduce is popularized by
Apache Hadoop project and used by Yahoo, Face book, Amazon and others.
• Marketing
• Finance
• Government
• Healthcare
• Insurance
• Retail
2
6. Define Data analytics.
Data analytics (DA) is the science of examining raw data with the purpose of drawing
conclusions about that information. Data analytics is used in many industries to allow companies
and organization to make better business decisions and in the sciences to verify or disprove
existing models or theories.
• Knowledge mining
• Knowledge extraction
• Data/ pattern analysis.
• Data Archaeology
• Data dredging
• Data cleaning
• Data Mining
• Pattern Evaluation
• Knowledge Presentation
• Data Integration
• Data Selection
• Data Transformation
Concept of combining the predictions made from multiple models of data mining and
analyzing those predictions to formulate a new and previously unknown prediction.
• GUI
• Pattern Evaluation
• Database or Data warehouse server DBDW
3
10. Define descriptive model
It is used to determine the patterns and relationships in a sample data. Data mining tasks
that belongs to descriptive model:
• Clustering
• Summarization
• Association rules
• Sequence discovery
Pattern represents knowledge if it is easily understood by humans; valid on test data with
some degree of certainty; and potentially useful, novel, or validates a hunch about which the
used was curious. Measures of pattern interestingness, either objective or subjective, can be used
to guide the discovery process.
In Hadoop 1.0, the batch processing framework Map Reduce was closely paired with
HDFS (Hadoop Distributed File System). With the addition of YARN to these two components,
giving birth to Hadoop 2.0, came a lot of differences in the ways in which Hadoop worked. Let’s
go through these differences.
4
Namespace With YARN, Hadoop Only one namespace
supports multiple namespaces could be supported, i.e.,
HDFS
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming models. It
is designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-availability, the library
itself is designed to detect and handle failures at the application layer, so delivering a highly-
available service on top of a cluster of computers, each of which may be prone to failures.
The Big Data analytics is indeed a revolution in the field of Information Technology. The
use of Data analytics by the companies is enhancing every year. The primary focus of the
companies is on customers. Hence the field is flourishing in Business to Consumer (B2C)
applications. We divide the analytics into different types as per the nature of the environment.
We have three divisions of Big Data analytics: Prescriptive Analytics, Predictive Analytics, and
Descriptive Analytics.
5
1. What are the Characteristics of Big Data? List the different types of big data
applications. 13 Marks
(i) Volume
• The name 'Big Data' itself is related to a size which is enormous. Size of data
plays very crucial role in determining value out of data.
• Also, whether a particular data can actually be considered as a Big Data or not, is
dependent upon volume of data. Hence, 'Volume' is one characteristic which
needs to be considered while dealing with 'Big Data'.
6
(ii) Variety
• Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured.
• During earlier days, spreadsheets and databases were the only sources of data
considered by most of the applications.
• Now days, data in the form of emails, photos, videos, monitoring devices, PDFs,
audio, etc. is also being considered in the analysis applications.
• This variety of unstructured data poses certain issues for storage, mining and
analyzing data.
(iii) Velocity
(iv) Variability
• This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
7
Real Time Big data Applications:
Demand can be forecasted properly as per different conditions available with Big Data.
Big data can be used to identify machinery and process variations that may be indicators
of quality problems.
Based on data available, its analysis could be done to ensure proper distribution in proper
market.
Big data helps in knowing better marketing strategy that could increase ale.
8
6) Price Management using big data
To maintain position in market, price management plays a key role and Big data helps
business in knowing market trend for it.
7) Merchandising
Big Data plays a major role in sales for retail market also.
It helps in increasing sale for the business. It also helps in optimizing assignment of sales
resources and accounts, product mix and other operations.
• Different tools can be used to monitor store operations which reduce manual
work.
• Big data helps in adjusting inventory levels on the basis of predicted buying
patterns, study of demographics, weather, key events, and other factors.
• Big Data has provided biggest opportunity to companies like citi bank to see the
big picture due to balancing the sensitive nature of the data for delivering value to
clients along with prioritizing the privacy and protection of information.
• It has been fully adopted by many companies to drive business growth and
enhance the services they provide to customers.
• Understand further how Income tax has benefited from big data.
9
12) Big data in Finance sector
• Financial services have widely adopted big data analytics to inform better
investment decisions with consistent returns.
• The big data pendulum for financial services has swung from passing fad to large
deployments last year.
• A recent report, “Global Big Data Analytics Market in Telecom Industry 2014-
2018,” found that use of data analytics tools in telecom sector is expected to grow
at a compound annual growth rate of 28.28 percent over the next four years.
• Mobile Telecom harnesses Big Data with combined actuate and hadoop solution.
• Big data is used for analyzing data in the electronic medical record (EMR) system
with the goal of reducing costs and improving patient care.
• This Data includes the unstructured data from physician notes, pathology reports
etc. Big Data and healthcare analytics have the power to predict, prevent & cure
diseases.
• Big data is changing the media and entertainment industry, giving users and
viewers a much more personalized and enriched experience.
• Big data is used for increasing revenues, understanding real-time customer
sentiment, increasing marketing effectiveness and ratings and viewership.
10
17) Big Data in tourism
• Big data is transforming the global tourism industry. People know more about the
world than ever before.
• People have much more detailed itineraries these days with the help of Big data.
Big data is a driving factor behind every marketing decision made by social media
companies and it is driving personalization to the extreme.
The major difference between traditional data and big data are discussed below.
Data architecture
• Traditional data use centralized database architecture in which large and complex
problems are solved by a single computer system.
• Centralized architecture is costly and ineffective to process large amount of data.
Big data is based on the distributed database architecture where a large block of
data is solved by dividing it into several smaller sizes.
• Then the solution to a problem is computed by several different computers present
in a given computer network. The computers communicate to each other in order
to find the solution to a problem.
• The distributed database provides better computing, lower price and also improve
the performance as compared to the centralized database system.
• This is because centralized architecture is based on the mainframes which are not
as economic as microprocessors in distributed database system.
11
• Also the distributed database has more computational power as compared to the
centralized database system which is used to manage traditional data.
Types of data
• Traditional database systems are based on the structured data i.e. traditional data
is stored in fixed format or fields in a file.
12
• Examples of the unstructured data include Relational Database System (RDBMS)
and the spreadsheets, which only answers to the questions about what happened.
• Traditional database only provides an insight to a problem at the small level.
• However in order to enhance the ability of an organization, to gain more insight
into the data and also to know about metadata unstructured data is used.
• Big data uses the semi-structured and unstructured data and improves the variety
of the data gathered from different sources like customers, audience or
subscribers.
• After the collection, Bid data transforms it into knowledge based information
Volume of data
• The traditional system database can store only small amount of data ranging from
gigabytes to terabytes.
• However, big data helps to store and process large amount of data which consists
of hundreds of terabytes of data or peta bytes of data and beyond.
• The storage of massive amount of data would reduce the overall cost for storing
data and help in providing business intelligence.
Data schema
13
Data relationship
• In the traditional database system relationship between the data items can be
explored easily as the number of information’s stored is small.
• However, big data contains massive or voluminous data which increase the level
of difficulty in figuring out the relationship between the data items
Scaling
• Scaling refers to demand of the resources and servers required to carry out the
computation.
• Big data is based on the scale out architecture under which the distributed
approaches for computing are employed with more than one server.
• So, the load of the computation is shared with single application based system.
• However, achieving the scalability in the traditional database is very difficult
because the traditional database runs on the single server and requires expensive
servers to scale up.
• Under the traditional database system it is very expensive to store massive amount
of data, so all the data cannot be stored.
14
• This would decrease the amount of data to be analyzed which will decrease the
result’s accuracy and confidence.
• While in big data as the amount required to store voluminous data is lower.
• Therefore the data is stored in big data systems and the points of correlation are
identified which would provide high accurate results.
Following are some of the prominent big data analytics tools and techniques that are used by
analytics developers.
Cassandra
• This is the most applauded and widely used big data tool because it offers an
effective management of large and intricate amounts of data.
• This is a database which offers high availability and scalability without affecting
the performance of commodity hardware and cloud infrastructure.
• Cassandra has many advantages and some of those are fault tolerance,
decentralization, durability, performance, professional support, elasticity, and
scalability.
• Since this tool has so many qualities hence it is loved by all the analytics
developers. Companies which are using Cassandra big data analytics tool are
eBay and Netflix.
15
Hadoop
• This is a striking product from Apache which has been used by many eminent
companies.
• Hadoop is basically an open-source software framework which is written in Java
language so that it can work with a chunk of data sets.
• It is designed in such a way so that it can scale up from a single server to
hundreds of machines.
• The most prominent feature of this advanced software library is superior
processing of voluminous data sets.
• Many companies choose big data tool Hadoop because of its great processing
capabilities. With this tool, the developer provides regular updates and
improvements to the product.
Knime:
• This is a big data analytics open source data tool. Knime is a leading analytics
platform which provides an open solution for data-driven innovation.
• With the help of this tool, you can discover the hidden potential of your data,
mine for fresh insights, and can predict new futures by analyzing the data.
• With nearly 1000 modules, hundreds of ready-to-run examples, a complete range
of integrated tools, and a chunk of advanced algorithms available, this Knime
analytics platform is certainly the best toolbox for any data scientist who wants to
accomplish his job in a hassle-free way.
• This tool can support any type of data like XML, JSON, Images, documents, and
more. This tool also possesses advanced predictive and machine learning
algorithms.
Open Refine:
• Are you stuck up with large and voluminous data sets? Then this tool is ideal for
you which help you to explore huge and baggy data sets easily.
16
• Basically, Open Refine helps to organize the data in the database that was nothing
but a mess and muddle.
• This tool helps you in cleaning and transforming data from one format into
another.
• This data tool can also be used to link and extend your datasets with web services
and other peripheral data.
• Earlier, Open Refine is known as Google Refine but from 2012, Google didn’t
support this project and it was then rebranded to Open Refine.
R language:
Plotly:
• As a successful big data analytics tool, Plotly has been used to create great
dynamic visualization even the organization has inadequate time or skills for
meeting big data needs.
• With the help of this tool, you can create stunning and informative graphics very
effortlessly.
• Basically, Plotly is used for composing, editing, and sharing interactive data
visualization via web.
17
Bokeh:
• This tool has many resemblances with Plotly. This tool is very effective and
useful if you want to create easy and informative visualizations.
• Bokeh is a Python interactive visualization library which helps you in creating
astounding and meaningful visual presentation of data in the web browsers.
• Thus, this tool is widely used by big data analytics experienced persons to create
interactive data applications, dashboards, and plots quickly and easily.
• Many data analytics experts claimed that Bokeh is the most progressive and
effective visual data representation tool.
Neo4j:
• Neo4j is one of the leading big data analytics tools as it takes the big data business
to the next level.
• Neo4j is a graph database management system which is developed by Neo4j Inc.
This tool helps to work with the connections between them.
• The connections between the data drive modern intelligent applications, and
Neo4j is the tool that transforms these connections to gain competitive advantage.
• As per DB-Engines ranking, Neo4j is the most popular graph database.
Rapid miner:
• This is certainly one of the favourite tools for all the data specialists. Like Knime,
this is also an open source data science platform which operates through visual
programming.
• This tool has the capability of manipulating, analysing, modeling and integrating
the data into business processes.
• RapidMiner helps data science teams to become more productive by giving an
open source platform for data preparation, model deployment, and machine
learning.
• Its unified data science platform accelerates the building of complete analytical
workflows.
18
• From data preparation to machine learning to model deployment, everything can
be done under a single environment.
• This actually enhances the efficiency and lessens the time for various data science
projects.
Wolfram Alpha:
• If you want to do something new from your data, then this could be an ideal tool
for you. This will give you every minute detail of your data.
• This famous tool was developed by Wolfram alpha LLC which is a subsidiary of
Wolfram Research.
• If you want to do advanced research on financial, historical, social, and other
professional areas, then you must use this platform.
• Suppose, if you type Microsoft, then you will receive miscellaneous information
including input interpretation, fundamentals, financials, new trade, price,
performance comparisons, data return analysis, and much more relevant
information.
Orange:
• Orange is an open source data visualization and data analysis tool which can be
used by both novice and sagacious persons in the field of data analytics.
• This tool provides interactive workflows with a large toolbox. With the help of
this toolbox, you can create interactive workflows to analyse and visualize data.
• Orange is crammed many different visualizations like from scatter plots, bar
charts, trees, to dendrograms, networks and heat maps, you can find everything in
this tool.
Node XL:
• This is a data visualization and analysis software tool for relationships and
networks. This tool offers exact calculations to the users.
• You will be glad to know that it is a free and open-source network analysis and
visualization software tool which has a wide range of application.
19
• This tool is considered as one of the best and latest statistical tools for data
analysis which gives advanced network metrics, automation, access to social
media network data importers, and many more things.
Storm:
• Storm has inscribed its name as one of the popular data analytics tools because of
its superior streaming data processing capabilities in real time.
• You can even integrate this tool with many other tools like Apache Slider in order
to manage and secure your data.
• Storm can be used by an organization in many cases like data monetization, cyber
security analytics, detection of the threat, operational dashboards, real-time
customer management, etc.
• All these functions can enhance your business growth and will give you many
opportunities for the betterment of your business.
• Hope, from the above-mentioned list, you got enough information regarding some
of the best data analytics tools which will be ruling in the upcoming years.
• If you want to establish your business firmly, then enhance your knowledge of
these data analytics tools.
❖ Norjimm is one among the most popular custom software development company
in India with teams having an project experience of 6+ years and with many
happy clients in different parts of the world.
❖ We are known for our innovative and future preparatory approach that we follow.
Get in touch with us today to get the best software and app development services.
20
4. Draw the Structure of Big Data and illustrate the challenges of conventional systems.
Figure: Big Data structures, models and their linkage at different processing stages.
In the past, the term ‘Analytics' has been used in the business intelligence world to
provide tools and intelligence to gain insight into the data through fast, consistent, interactive
access to a wide variety of possible views of information.
Data mining has been used in enterprises to keep pace with the critical monitoring and
analysis of mountains of data. The main challenge in the traditional approach is how to unearth
all the hidden information through the vast amount of data.
• Traditional Analytics analyzes on the known data terrain that too the data that is well
understood. It cannot work on unstructured data efficiently.
• Traditional Analytics is built on top of the relational data model, relationships between
the subjects of interests have been created inside the system and the analysis is done
based on them. This approach will not adequate for big data analytics.
21
• Traditional analytics is batch oriented and we need to wait for nightly ETL (extract,
transform and load) and transformation jobs to complete before the required insight is
obtained.
• Parallelism in a traditional analytics system is achieved through costly hardware like
MPP (Massively Parallel Processing) systems
• Inadequate support of aggregated summaries of data
22
5. Briefly explain the Architecture of Hadoop Distributed File Systems (HDFS)
• Most computing is done on a single processor, with its main memory, cache, and local
disk (a compute node).
• In the past, applications that called for parallel processing, such as large scientific
calculations, were done on special-purpose parallel computers with many processors and
specialized hardware.
• However, the prevalence of large-scale Web services has caused more and more
computing to be done on installations with thousands of compute nodes operating more
or less independently.
• In these installations, the compute nodes are commodity hardware, which greatly reduces
the cost compared with special-purpose parallel machines.
• These new computing facilities have given rise to a new generation of programming
systems. These systems take advantage of the power of parallelism and at the same time
avoid the reliability problems that arise when the computing hardware consists of
thousands of independent components, any of which could fail at any time.
In DFS first we need: access transparency and location transparency. Distributed File
System Requirements are performance, scalability, concurrency control, fault tolerance and
security requirements emerged and tolerance and security requirements emerged and were met in
the later phases of DFS development.
23
✓ Scaling transparency: increase in size of storage and network size should be
transparent.
The following are the characteristics of these computing installations and the specialized file
systems that have been developed to take advantage of them.
24
• The bandwidth of inter-rack communication is somewhat greater than the intra rack
Ethernet, but given the number of pairs of nodes that might need to communicate
between racks, this bandwidth may be essential.
• However, there may be many more racks and many more compute nodes per rack.
• It is a fact of life that components fail, and the more components, such as compute nodes
and interconnection networks, a system has, the more frequently something in the system
will not be working at any given time. Some important calculations take minutes or even
hours on thousands of compute nodes.
• If we had to abort and restart the computation every time one component failed, then the
computation might never complete successfully.
The solution to this problem takes two forms:
1. Files must be stored redundantly. If we did not duplicate the file at several
compute nodes, then if one node failed, all its files would be unavailable until the node is
replaced. If we did not back up the files at all, and the disk crashes, the files would be lost
forever.
2. Computations must be divided into tasks, such that if any one task fails
to execute to completion, it can be restarted without affecting other tasks. This strategy is
followed by the map-reduce programming system
Fig:
25
Large-Scale File-System Organization
• To exploit cluster computing, files must look and behave somewhat differently from the
conventional file systems found on single computers.
• This new file system, often called a distributed file system or DFS (although this term
had other meanings in the past), is typically used as follows.
• Files can be enormous, possibly a terabyte in size. If you have only small
files, there is no point using a DFS for them.
• Files are rarely updated. Rather, they are read as data for some calculation, and
possibly additional data is appended to files from time to time.
Files are divided into chunks, which are typically 64 megabytes in size. Chunks are
replicated, perhaps three times, at three different compute nodes.
• Moreover, the nodes holding copies of one chunk should be located on different racks, so
we don’t lose all copies due to a rack failure.
• Normally, both the chunk size and the degree of replication can be decided by the user.
• To find the chunks of a file, there is another small file called the master node or name
node for that file.
• The master node is itself replicated, and a directory for the file system as a whole knows
where to find its copies.
• The directory itself can be replicated, and all participants using the DFS know where the
directory copies are.
Map-reduce are a style of computing that has been implemented several times.
Use an implementation of map-reduce to manage many large-scale computations in a way that is
tolerant of hardware faults.
All you need to write are two functions, called Map and Reduce, while the system manages the
parallel execution, coordination of tasks that execute Map or Reduce. In brief, a map-reduce
computation executes as follows:
1. Some number of Map tasks each is given one or more chunks from a distributed file
system. These Map tasks turn the chunk into a sequence of key-value pairs. The way key-value
26
pairs are produced from the input data is determined by the code written by the user for the Map
function.
2. The key-value pairs from each Map task are collected by a master controller and sorted
by key. The keys are divided among all the Reduce tasks, so all key-value pairs with the same
key wind up at the same Re-duce task.
3. The Reduce tasks work on one key at a time, and combine all the values associated with
that key in some way. The manner of combination of values is determined by the code written by
the user for the Reduce function.
o We view input files for a Map task as consisting of elements, which can be any
type: a tuple or a document, for example.
• A chunk is a collection of elements, and no element is stored across two chunks.
27
• Technically, all inputs to Map tasks and outputs from Reduce tasks are of the key-value-
pair form, but normally the keys of input elements are not relevant and we shall tend to
ignore them.
• Insisting on this form for inputs and outputs is motivated by the desire to allow
composition of several maps - reduce processes.
A Map function is written to convert input elements to key-value pairs.
• The types of keys and values are each arbitrary.
• Further, keys are not “keys” in the usual sense; they do not have to be unique. Rather a
Map task can produce several key-value pairs with the same key, even from the same
element.
• Grouping and aggregation is done the same way, regardless of what Map and Reduce
tasks do. The master controller process knows how many Reduce tasks there will be, say
r such tasks.
• The user typically tells the map-reduce system what r should be.
• Then the master controller normally picks a hash function that applies to keys and
produces a bucket number from 0 to r − 1. Each key that is output by a Map task is
hashed and its key-value pair is put in one of r local files.
• Each file is destined for one of the Reduce tasks.1After all the Map tasks have completed
successfully, the master controller merges the file from each Map task that is destined for
a particular Reduce task and feeds the merged file to that process as a sequence of key-
list-of-value pairs.
• That is, for each key k, the input to the Reduce task that handles key k is a pair of the
form (k, [v1, v2, . . . , vn]), where (k, v1), (k, v2), . . . , (k, vn) are all the key-value pairs
with key k coming from all the Map tasks.
o The Reduce function is written to take pairs consisting of a key and its list of
associated values and combine those values in some way.
28
• The output of a Reduce task is a sequence of key-value pairs consisting of each input key
k that the Reduce task received, paired with the combined value constructed from the list
of values that the Reduce task received along with key k.
• The outputs from all the Reduce tasks are merged into a single file.
Combiners
• It is common for the Reduce function to be associative and commutative. That is, the
values to be combined can be combined in any order, with the same result. It doesn’t
matter how we group a list of numbersv1, v2, . . . , vn; the sum will be the same.
• When the Reduce function is associative and commutative, it is possible to push some of
what Reduce does to the Map tasks.
• For example, instead of the Map tasks in Example 2.1 producing many pairs (w, 1), (w,
1), . . ., we could apply the Reduce function within the Map task, before the output of the
Map tasks is subject to grouping and aggregation.
• These key-value pairs would thus be replaced by one pair with key w and value equal to
the sum of all the 1’s in all those pairs.
• That is, the pairs with key w generated by a single Map task would be combined into a
pair (w,m), where m is the number of times that w appears among the documents handled
by this Map task.
• Note that it is still necessary to do grouping and aggregation and to pass the result to the
Reduce tasks, since there will typically be one key-value pair with key w coming from
each of the Map tasks.
o Let us now consider in more detail how a program using map-reduce is executed.
• Taking advantage of a library provided by a map-reduce system such as Hadoop, the user
program forks a Master controller process and some number of Worker processes at
different compute nodes.
• Normally, a Worker handles either Map tasks or Reduce tasks, but not both.
29
• The Master has many responsibilities. One is to create some number of Map tasks and
some number of Reduce tasks, these numbers being selected by the user program.
• These tasks will be assigned to Worker processes by the Master. It is reasonable to create
one Map task for every chunk of the input file(s), but we may wish to create fewer
Reduce tasks.
• The reason for limiting the number of Reduce tasks is that it is necessary for each Map
task to create an intermediate file for each Reduce task, and if there are too many Reduce
tasks the number of intermediate files explodes.
• A Worker process reports to the Master when it finishes a task, and a new task is
scheduled by the Master for that Worker process.
• Each Map task is assigned one or more chunks of the input file(s) and executes on it the
code written by the user.
• The Map task creates a file for each Reduce task on the local disk of the Worker that
executes the Map task.
• The Master is informed of the location and sizes of each of these files, and the Reduce
task for which each is destined.
30
• When a Reduce task is assigned by the master to a Worker process, that task is given all
the files that form its input.
• The Reduce task executes code written by the user and writes its output to a file that is
part of the surrounding distributed file system.
❖ The worst thing that can happen is that the compute node at which the Master is
executing fails. In this case, the entire map-reduce job must be restarted.
❖ But only this one node can bring the entire process down; other failures will be managed
by the Master, and the map-reduce job will complete eventually.
❖ Suppose the compute node at which a Map worker resides fails. This failure will be
detected by the Master, because it periodically pings the Worker processes.
❖ All the Map tasks that were assigned to this Worker will have to be redone, even if they
had completed.
❖ The reason for redoing completed Map tasks is that their output destined for the Reduce
tasks resides at that compute node, and is now unavailable to the Reduce tasks.
❖ The Master sets the status of each of these Map tasks to idle and will schedule them on a
Worker when one becomes available.
❖ The Master must also inform each Reduce task that the location of its input from that
Map task has changed. Dealing with a failure at the node of a Reduce worker is simpler.
❖ The Master simply sets the status of its currently executing Reduce tasks to idle. These
will be rescheduled on another reduce worker later.
The Map Reduce algorithm contains two important tasks, namely Map and Reduce.
❖ The output of Mapper class is used as input by Reducer class, which in turn searches
matching pairs and reduces them.
31
❖ Map Reduce implements various mathematical algorithms to divide a task into small
parts and assign them to multiple systems.
❖ In technical terms, Map Reduce algorithm helps in sending the Map & Reduce tasks to
appropriate servers in a cluster.
❖ Map-reduce is not a solution to every problem, not even every problem that profitably
can use many compute nodes operating in parallel.
❖ Thus, we would not expect to use either a DFS or an implementation of map-reduce for
managing on-line retail sales, even though a large on-line retailer such as Amazon.com
uses thousands of compute nodes when processing requests over the Web.
❖ The reason is that the principal operations on Amazon data involve responding to
searches for products, recording sales, and so on, processes that involve relatively little
calculation and that change the database.
❖ On the other hand, Amazon might use map-reduce to perform certain analytic queries on
large amounts of data, such as finding for each user those users whose buying patterns
were most similar.
❖ The original purpose for which the Google implementation of map-reduce was created
was to execute very large matrix-vector multiplications as are needed in the calculation of
Page Rank.
❖ We shall see that matrix-vector and matrix-matrix calculations fit nicely into the map-
reduce style of computing.
❖ Another important class of operations that can use map-reduce effectively are the
relational-algebra operations. We shall examine the map-reduce execution of these
operations as well.
32
7. Illustrate the concepts of HADOOP YARN.
❖ Sometimes called Map Reduce 2.0, YARN is a software rewrite that decouples
Map Reduce's resource management and scheduling capabilities from the data
processing component, enabling Hadoop to support more varied processing
approaches and a broader array of applications.
❖ For example, Hadoop clusters can now run interactive querying and streaming
data applications simultaneously with Map Reduce batch jobs.
❖ YARN combines a central resource manager that reconciles the way applications
use Hadoop system resources with node manager agents that monitor the
processing operations of individual cluster nodes.
❖ Separating HDFS from Map Reduce with YARN makes the Hadoop environment
more suitable for operational applications that can't wait for batch jobs to finish.
33
Resource Manager
Application Master
The Application Master allows YARN to exhibit the following key characteristics:
Scale:
❖ The Application Master provides much of the functionality of the traditional Resource
Manager so that the entire system can scale more dramatically.
❖ In tests, we’ve already successfully simulated 10,000 node clusters composed of modern
hardware without significant issue.
❖ This is one of the key reasons that we have chosen to design the Resource Manager as
a pure scheduler i.e. it doesn’t attempt to provide fault-tolerance for resources.
❖ We shifted that to become a primary responsibility of the Application Master instance.
Furthermore, since there is an instance of an Application Master per application, the
Application Master itself isn’t a common bottleneck in the cluster.
Open:
Moving all application framework specific code into the Application Master generalizes the
system so that we can now support multiple frameworks such as Map Reduce, MPI and Graph
Processing.
Resource Model
YARN supports a very general resource model for applications. An application can request
resources with highly specific requirements such as:
34
• Resource-name (hostname, rack name – we are in the process of generalizing this further
to support more complex network topologies with YARN-18).
• Memory (in MB)
• In future, expect us to add more resource-types such as disk/network I/O, GPUs etc.
• The Job Tracker is responsible for resource management (managing the worker nodes i.e.
Task Trackers), tracking resource consumption/availability and also job life-cycle
management (scheduling individual tasks of the job, tracking progress, providing fault-
tolerance for tasks etc).
• The Task Tracker has simple responsibilities – launch/teardown tasks on orders from the
Job Tracker and provide task-status information to the Job Tracker periodically.
35
How Yarn Works
YARN’s original purpose was to split up the two major responsibilities of the Job
Tracker/Task Tracker into separate entities:
❖ The Resource Manager and the Node Manager formed the new generic system for
managing applications in a distributed manner.
❖ The Resource Manager is the ultimate authority that arbitrates resources among all
applications in the system.
❖ The Application Master is a framework-specific entity that negotiates resources from the
Resource Manager and works with the Node Manager(s) to execute and monitor the
component tasks.
❖ The Resource Manager has a scheduler, which is responsible for allocating resources to the
various applications running in the cluster, according to constraints such as queue capacities
and user limits.
❖ Each Application Master has responsibility for negotiating appropriate resource containers
from the scheduler, tracking their status, and monitoring their progress.
❖ From the system perspective, the Application Master runs as a normal container.
❖ The Node Manager is the per-machine slave, which is responsible for launching the
applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and
reporting the same to the Resource Manager.
36