0% found this document useful (0 votes)

10 views36 pages

Unit 1

ggggg

Uploaded by

smilemadan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views36 pages

Unit 1

ggggg

Uploaded by

smilemadan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

VIVEKANANDHA

COLLEGE OF TECHNOLOGY FOR WOMEN

Elayampalayam, Tiruchengode – 637205.

DEPARTMENT OF INFORMATION TECHNOLOGY

SUBJECT NAME: BIG DATA ANALYTICS

SUBJECT CODE: CS8091
YEAR/SEM: III / VI

YEAR/SEM: IV / VIII
Unit-I

Regulation: 2017

Staff in charge HOD/IT Dean

1
Part - A
UNIT I INTRODUCTION TO BIG DATA

Evolution of Big data - Best Practices for Big data Analytics - Big data characteristics - Validating
- The Promotion of the Value of Big Data - Big Data Use Cases- Characteristics of Big Data
Applications - Perception and Quantification of Value -Understanding Big Data Storage - A
General Overview of High Performance Architecture - HDFS – Map Reduce and YARN - Map
Reduce Programming Model

1. What is big data approach?

Many it tools are available for big data projects. Organizations whose data workloads are
constant and predictable are better served by traditional database whereas organizations
challenged by increasing data demands will need to take advantage of Hadoop’s scalable
infrastructure.

2. What is analysis?

It is the process of exploring data and reports in order to extract meaningful insights which can
be used to better understand and improve business performance.

3. Write short note on Map Reduce?

Map Reduce provides a data parallel programming model for clusters of commodity machines. It
is pioneered by Google which process 20PB of data per day. Map Reduce is popularized by
Apache Hadoop project and used by Yahoo, Face book, Amazon and others.

4. List out the four major types of re sampling.

• Randomized exact test

• Cross-validation
• Jackknife
• Bootstrap

5. List out the applications of big data analytics.

• Marketing
• Finance
• Government
• Healthcare
• Insurance
• Retail

2
6. Define Data analytics.

Data analytics (DA) is the science of examining raw data with the purpose of drawing
conclusions about that information. Data analytics is used in many industries to allow companies
and organization to make better business decisions and in the sciences to verify or disprove
existing models or theories.

7. Give some alternative terms for data mining.

• Knowledge mining
• Knowledge extraction
• Data/ pattern analysis.
• Data Archaeology
• Data dredging

8. What is KDD? List the steps involved in KDD process.

KDD – Knowledge Discovery in Databases.

Steps involved in KDD

• Data cleaning
• Data Mining
• Pattern Evaluation
• Knowledge Presentation
• Data Integration
• Data Selection
• Data Transformation

9. What is Meta learning?

Concept of combining the predictions made from multiple models of data mining and
analyzing those predictions to formulate a new and previously unknown prediction.

• GUI
• Pattern Evaluation
• Database or Data warehouse server DBDW

3
10. Define descriptive model

It is used to determine the patterns and relationships in a sample data. Data mining tasks
that belongs to descriptive model:

• Clustering
• Summarization
• Association rules
• Sequence discovery

11. What is meant by pattern?

Pattern represents knowledge if it is easily understood by humans; valid on test data with
some degree of certainty; and potentially useful, novel, or validates a hunch about which the
used was curious. Measures of pattern interestingness, either objective or subjective, can be used
to guide the discovery process.

12. Differentiate YARN and Map Reduce

In Hadoop 1.0, the batch processing framework Map Reduce was closely paired with
HDFS (Hadoop Distributed File System). With the addition of YARN to these two components,
giving birth to Hadoop 2.0, came a lot of differences in the ways in which Hadoop worked. Let’s
go through these differences.

Criteria YARN MapReduce

Real-time, batch, and Silo and batch

Type of processing interactive processing with processing with a single
multiple engines engine

Cluster resource Excellent due to Average due to fixed

optimization central resource management Map and Reduce slots

MapReduce and non- Only MapReduce

Suitable for
MapReduce applications applications

Managing cluster Done by YARN Done by JobTracker

resource

4
Namespace With YARN, Hadoop Only one namespace
supports multiple namespaces could be supported, i.e.,
HDFS

13. Define Apache Hadoop.

The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming models. It
is designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-availability, the library
itself is designed to detect and handle failures at the application layer, so delivering a highly-
available service on top of a cluster of computers, each of which may be prone to failures.

14. Draw the architecture of Hadoop Yarn.

15. What is the importance of Big Data Analytics?

The Big Data analytics is indeed a revolution in the field of Information Technology. The
use of Data analytics by the companies is enhancing every year. The primary focus of the
companies is on customers. Hence the field is flourishing in Business to Consumer (B2C)
applications. We divide the analytics into different types as per the nature of the environment.
We have three divisions of Big Data analytics: Prescriptive Analytics, Predictive Analytics, and
Descriptive Analytics.

5
1. What are the Characteristics of Big Data? List the different types of big data
applications. 13 Marks

Characteristics of Big Data

(i) Volume

• The name 'Big Data' itself is related to a size which is enormous. Size of data
plays very crucial role in determining value out of data.
• Also, whether a particular data can actually be considered as a Big Data or not, is
dependent upon volume of data. Hence, 'Volume' is one characteristic which
needs to be considered while dealing with 'Big Data'.

Fig: Characteristics of Big Data

6
(ii) Variety

• Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured.
• During earlier days, spreadsheets and databases were the only sources of data
considered by most of the applications.
• Now days, data in the form of emails, photos, videos, monitoring devices, PDFs,
audio, etc. is also being considered in the analysis applications.
• This variety of unstructured data poses certain issues for storage, mining and
analyzing data.

(iii) Velocity

• The term 'velocity' refers to the speed of generation of data.

• How fast the data is generated and processed to meet the demands, determines
real potential in the data.
• Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks and social media sites, sensors,
Mobile devices, etc.
• The flow of data is massive and continuous.

(iv) Variability

• This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.

Big Data Applications

Some of the industries propelled by big data analytics are:

• Public Sector Services.

• Healthcare contributions.
• Learning Services.
• Insurance Services.
• Industrialized and Natural Resources.
• Transportation Services.
• Banking Sectors and Fraud Detection

7
Real Time Big data Applications:

1) Procurement with Big data

Demand can be forecasted properly as per different conditions available with Big Data.

2) Big data in Product development

What product to be developed to increase sales.

3) Big data in manufacturing sector

Big data can be used to identify machinery and process variations that may be indicators
of quality problems.

4) Big data for product distribution

Based on data available, its analysis could be done to ensure proper distribution in proper
market.

5) Big data in marketing field

Big data helps in knowing better marketing strategy that could increase ale.

8
6) Price Management using big data

To maintain position in market, price management plays a key role and Big data helps
business in knowing market trend for it.

7) Merchandising

Big Data plays a major role in sales for retail market also.

8) Big data in Sales

It helps in increasing sale for the business. It also helps in optimizing assignment of sales
resources and accounts, product mix and other operations.

9) Store Operations using Big Data

• Different tools can be used to monitor store operations which reduce manual
work.
• Big data helps in adjusting inventory levels on the basis of predicted buying
patterns, study of demographics, weather, key events, and other factors.

10) Big data in Human Resources

• Big Data has changed way of recruitment and other HR operations.

• You can also find out the characteristics and behaviors of successful and effective
employees, as well as other employee insights to manage talent better.

11) Big data in Banking

• Big Data has provided biggest opportunity to companies like citi bank to see the
big picture due to balancing the sensitive nature of the data for delivering value to
clients along with prioritizing the privacy and protection of information.
• It has been fully adopted by many companies to drive business growth and
enhance the services they provide to customers.
• Understand further how Income tax has benefited from big data.

9
12) Big data in Finance sector

• Financial services have widely adopted big data analytics to inform better
investment decisions with consistent returns.
• The big data pendulum for financial services has swung from passing fad to large
deployments last year.

13) Big data in Telecom

• A recent report, “Global Big Data Analytics Market in Telecom Industry 2014-
2018,” found that use of data analytics tools in telecom sector is expected to grow
at a compound annual growth rate of 28.28 percent over the next four years.
• Mobile Telecom harnesses Big Data with combined actuate and hadoop solution.

14) Big data in retail sector

• Retailers harness Big Data to offer consumers personalized shopping experiences.

• Analyzing how a customer came to make a purchase, or the path to purchase, is 1
way big data tech is making a mark in retail. 66% of retailers have made financial
gains in customer relationship management through big data.

15) Big data in HealthCare

• Big data is used for analyzing data in the electronic medical record (EMR) system
with the goal of reducing costs and improving patient care.
• This Data includes the unstructured data from physician notes, pathology reports
etc. Big Data and healthcare analytics have the power to predict, prevent & cure
diseases.

16) Big data in Media and Entertainment

• Big data is changing the media and entertainment industry, giving users and
viewers a much more personalized and enriched experience.
• Big data is used for increasing revenues, understanding real-time customer
sentiment, increasing marketing effectiveness and ratings and viewership.

10
17) Big Data in tourism

• Big data is transforming the global tourism industry. People know more about the
world than ever before.
• People have much more detailed itineraries these days with the help of Big data.

18) Big data in Airlines

• Big Data and Analytics give wings to the Aviation Industry.

• An airline now knows where a plane is headed, where a passenger is sitting, and
what a passenger is viewing on the IFE or connectivity system.

19) Big data in Social Media

Big data is a driving factor behind every marketing decision made by social media
companies and it is driving personalization to the extreme.

2. Differentiate big data and traditional data.

The major difference between traditional data and big data are discussed below.

Data architecture

• Traditional data use centralized database architecture in which large and complex
problems are solved by a single computer system.
• Centralized architecture is costly and ineffective to process large amount of data.
Big data is based on the distributed database architecture where a large block of
data is solved by dividing it into several smaller sizes.
• Then the solution to a problem is computed by several different computers present
in a given computer network. The computers communicate to each other in order
to find the solution to a problem.
• The distributed database provides better computing, lower price and also improve
the performance as compared to the centralized database system.
• This is because centralized architecture is based on the mainframes which are not
as economic as microprocessors in distributed database system.

11
• Also the distributed database has more computational power as compared to the
centralized database system which is used to manage traditional data.

Types of data

• Traditional database systems are based on the structured data i.e. traditional data
is stored in fixed format or fields in a file.

12
• Examples of the unstructured data include Relational Database System (RDBMS)
and the spreadsheets, which only answers to the questions about what happened.
• Traditional database only provides an insight to a problem at the small level.
• However in order to enhance the ability of an organization, to gain more insight
into the data and also to know about metadata unstructured data is used.
• Big data uses the semi-structured and unstructured data and improves the variety
of the data gathered from different sources like customers, audience or
subscribers.
• After the collection, Bid data transforms it into knowledge based information

Volume of data

• The traditional system database can store only small amount of data ranging from
gigabytes to terabytes.
• However, big data helps to store and process large amount of data which consists
of hundreds of terabytes of data or peta bytes of data and beyond.
• The storage of massive amount of data would reduce the overall cost for storing
data and help in providing business intelligence.

Data schema

• Big data uses the dynamic schema for data storage.

• Both the un-structured and structured information can be stored and any schema
can be used since the schema is applied only after a query is generated.
• Big data is stored in raw format and then the schema is applied only when the data
is to be read. This process is beneficial in preserving the information present in
the data.
• The traditional database is based on the fixed schema which is static in nature.
• In traditional database data cannot be changed once it is saved and this is only
done during write operations

13
Data relationship

• In the traditional database system relationship between the data items can be
explored easily as the number of information’s stored is small.
• However, big data contains massive or voluminous data which increase the level
of difficulty in figuring out the relationship between the data items

Scaling

• Scaling refers to demand of the resources and servers required to carry out the
computation.
• Big data is based on the scale out architecture under which the distributed
approaches for computing are employed with more than one server.
• So, the load of the computation is shared with single application based system.
• However, achieving the scalability in the traditional database is very difficult
because the traditional database runs on the single server and requires expensive
servers to scale up.

Higher cost of traditional data

• Traditional database system requires complex and expensive hardware and

software in order to manage large amount of data.
• Also moving the data from one system to another requires more number of
hardware and software resources which increases the cost significantly.
• While in case of big data as the massive amount of data is segregated between
various systems, the amount of data decreases.
• So use of big data is quite simple, makes use of commodity hardware and open
source software to process the data

Accuracy and confidentiality

• Under the traditional database system it is very expensive to store massive amount
of data, so all the data cannot be stored.

14
• This would decrease the amount of data to be analyzed which will decrease the
result’s accuracy and confidence.
• While in big data as the amount required to store voluminous data is lower.
• Therefore the data is stored in big data systems and the points of correlation are
identified which would provide high accurate results.

3.Briefly explain the modern data analytic tools.

Following are some of the prominent big data analytics tools and techniques that are used by
analytics developers.

Cassandra

• This is the most applauded and widely used big data tool because it offers an
effective management of large and intricate amounts of data.
• This is a database which offers high availability and scalability without affecting
the performance of commodity hardware and cloud infrastructure.
• Cassandra has many advantages and some of those are fault tolerance,
decentralization, durability, performance, professional support, elasticity, and
scalability.
• Since this tool has so many qualities hence it is loved by all the analytics
developers. Companies which are using Cassandra big data analytics tool are
eBay and Netflix.

15
Hadoop

• This is a striking product from Apache which has been used by many eminent
companies.
• Hadoop is basically an open-source software framework which is written in Java
language so that it can work with a chunk of data sets.
• It is designed in such a way so that it can scale up from a single server to
hundreds of machines.
• The most prominent feature of this advanced software library is superior
processing of voluminous data sets.
• Many companies choose big data tool Hadoop because of its great processing
capabilities. With this tool, the developer provides regular updates and
improvements to the product.

Knime:

• This is a big data analytics open source data tool. Knime is a leading analytics
platform which provides an open solution for data-driven innovation.
• With the help of this tool, you can discover the hidden potential of your data,
mine for fresh insights, and can predict new futures by analyzing the data.
• With nearly 1000 modules, hundreds of ready-to-run examples, a complete range
of integrated tools, and a chunk of advanced algorithms available, this Knime
analytics platform is certainly the best toolbox for any data scientist who wants to
accomplish his job in a hassle-free way.
• This tool can support any type of data like XML, JSON, Images, documents, and
more. This tool also possesses advanced predictive and machine learning
algorithms.

Open Refine:

• Are you stuck up with large and voluminous data sets? Then this tool is ideal for
you which help you to explore huge and baggy data sets easily.

16
• Basically, Open Refine helps to organize the data in the database that was nothing
but a mess and muddle.
• This tool helps you in cleaning and transforming data from one format into
another.
• This data tool can also be used to link and extend your datasets with web services
and other peripheral data.
• Earlier, Open Refine is known as Google Refine but from 2012, Google didn’t
support this project and it was then rebranded to Open Refine.

R language:

• R is an open source programming language which helps the organizations to

manage and analyze a chunk of data effectively and aptly.
• The language was initially written by Ross Ihaka and Robert Gentleman but it has
got immense appreciation from the mathematicians, statisticians, data scientists
and data miners who are in the field of data analytics.
• R is packed with a host of data analysis tools which make the analysis of data
more facile and simpler for the users.
• With R, businesses don’t need to develop the customized tools and moreover,
they can easily get rid of the time-consuming codes.
• R is the prime data analysis software which consists of innumerable algorithms
that are designed for data retrieval, processing, analysis and high-end statistical
graphics representations.

Plotly:

• As a successful big data analytics tool, Plotly has been used to create great
dynamic visualization even the organization has inadequate time or skills for
meeting big data needs.
• With the help of this tool, you can create stunning and informative graphics very
effortlessly.
• Basically, Plotly is used for composing, editing, and sharing interactive data
visualization via web.

17
Bokeh:

• This tool has many resemblances with Plotly. This tool is very effective and
useful if you want to create easy and informative visualizations.
• Bokeh is a Python interactive visualization library which helps you in creating
astounding and meaningful visual presentation of data in the web browsers.
• Thus, this tool is widely used by big data analytics experienced persons to create
interactive data applications, dashboards, and plots quickly and easily.
• Many data analytics experts claimed that Bokeh is the most progressive and
effective visual data representation tool.

Neo4j:

• Neo4j is one of the leading big data analytics tools as it takes the big data business
to the next level.
• Neo4j is a graph database management system which is developed by Neo4j Inc.
This tool helps to work with the connections between them.
• The connections between the data drive modern intelligent applications, and
Neo4j is the tool that transforms these connections to gain competitive advantage.
• As per DB-Engines ranking, Neo4j is the most popular graph database.

Rapid miner:

• This is certainly one of the favourite tools for all the data specialists. Like Knime,
this is also an open source data science platform which operates through visual
programming.
• This tool has the capability of manipulating, analysing, modeling and integrating
the data into business processes.
• RapidMiner helps data science teams to become more productive by giving an
open source platform for data preparation, model deployment, and machine
learning.
• Its unified data science platform accelerates the building of complete analytical
workflows.

18
• From data preparation to machine learning to model deployment, everything can
be done under a single environment.
• This actually enhances the efficiency and lessens the time for various data science
projects.

Wolfram Alpha:

• If you want to do something new from your data, then this could be an ideal tool
for you. This will give you every minute detail of your data.
• This famous tool was developed by Wolfram alpha LLC which is a subsidiary of
Wolfram Research.
• If you want to do advanced research on financial, historical, social, and other
professional areas, then you must use this platform.
• Suppose, if you type Microsoft, then you will receive miscellaneous information
including input interpretation, fundamentals, financials, new trade, price,
performance comparisons, data return analysis, and much more relevant
information.

Orange:

• Orange is an open source data visualization and data analysis tool which can be
used by both novice and sagacious persons in the field of data analytics.
• This tool provides interactive workflows with a large toolbox. With the help of
this toolbox, you can create interactive workflows to analyse and visualize data.
• Orange is crammed many different visualizations like from scatter plots, bar
charts, trees, to dendrograms, networks and heat maps, you can find everything in
this tool.

Node XL:

• This is a data visualization and analysis software tool for relationships and
networks. This tool offers exact calculations to the users.
• You will be glad to know that it is a free and open-source network analysis and
visualization software tool which has a wide range of application.

19
• This tool is considered as one of the best and latest statistical tools for data
analysis which gives advanced network metrics, automation, access to social
media network data importers, and many more things.

Storm:

• Storm has inscribed its name as one of the popular data analytics tools because of
its superior streaming data processing capabilities in real time.
• You can even integrate this tool with many other tools like Apache Slider in order
to manage and secure your data.
• Storm can be used by an organization in many cases like data monetization, cyber
security analytics, detection of the threat, operational dashboards, real-time
customer management, etc.
• All these functions can enhance your business growth and will give you many
opportunities for the betterment of your business.
• Hope, from the above-mentioned list, you got enough information regarding some
of the best data analytics tools which will be ruling in the upcoming years.
• If you want to establish your business firmly, then enhance your knowledge of
these data analytics tools.

❖ Norjimm is one among the most popular custom software development company
in India with teams having an project experience of 6+ years and with many
happy clients in different parts of the world.
❖ We are known for our innovative and future preparatory approach that we follow.
Get in touch with us today to get the best software and app development services.

20
4. Draw the Structure of Big Data and illustrate the challenges of conventional systems.

Figure: Big Data structures, models and their linkage at different processing stages.

In the past, the term ‘Analytics' has been used in the business intelligence world to
provide tools and intelligence to gain insight into the data through fast, consistent, interactive
access to a wide variety of possible views of information.
Data mining has been used in enterprises to keep pace with the critical monitoring and
analysis of mountains of data. The main challenge in the traditional approach is how to unearth
all the hidden information through the vast amount of data.
• Traditional Analytics analyzes on the known data terrain that too the data that is well
understood. It cannot work on unstructured data efficiently.
• Traditional Analytics is built on top of the relational data model, relationships between
the subjects of interests have been created inside the system and the analysis is done
based on them. This approach will not adequate for big data analytics.

21
• Traditional analytics is batch oriented and we need to wait for nightly ETL (extract,
transform and load) and transformation jobs to complete before the required insight is
obtained.
• Parallelism in a traditional analytics system is achieved through costly hardware like
MPP (Massively Parallel Processing) systems
• Inadequate support of aggregated summaries of data

Apart from these challenges others are categorized as

Data challenges
-Volume, velocity, veracity, variety
-Data discovery and comprehensiveness
-Scalability
Process challenges
-Capturing data
-Aligning data from different sources
-Transforming data into suitable form for data analysis
-Modeling data (mathematically, simulation)
-Understanding output, visualizing results and display issues on mobile devices
Management challenges
-Security
-Privacy
-Governance
-Ethical issues
-Traditional/ RDBMS challenges
-Designed to handle well structured data
-traditional storage vendor solutions are very expensive
-shared block-level storage is too slow
-read data in 8k or 16k block size
-Schema-on-write requires data be validated before it can be written to disk.
-Software licenses are too expensive
-Get data from disk and load into memory requires application

22
5. Briefly explain the Architecture of Hadoop Distributed File Systems (HDFS)

• Most computing is done on a single processor, with its main memory, cache, and local
disk (a compute node).
• In the past, applications that called for parallel processing, such as large scientific
calculations, were done on special-purpose parallel computers with many processors and
specialized hardware.
• However, the prevalence of large-scale Web services has caused more and more
computing to be done on installations with thousands of compute nodes operating more
or less independently.
• In these installations, the compute nodes are commodity hardware, which greatly reduces
the cost compared with special-purpose parallel machines.
• These new computing facilities have given rise to a new generation of programming
systems. These systems take advantage of the power of parallelism and at the same time
avoid the reliability problems that arise when the computing hardware consists of
thousands of independent components, any of which could fail at any time.

Distributed File System Requirements:

In DFS first we need: access transparency and location transparency. Distributed File
System Requirements are performance, scalability, concurrency control, fault tolerance and
security requirements emerged and tolerance and security requirements emerged and were met in
the later phases of DFS development.

✓ Access transparency: Client programs should be unaware of the distribution of files.

✓ Location transparency: Client program should see a uniform namespace. Files should
be able to be relocated without changing their path name.
✓ Mobility transparency: Neither client programs nor system admin program tables in the
client nodes should be changed when files are moved either automatically or by the
system admin.
✓ Performance transparency: Client programs should continue to perform well on load
within a specified range.

23
✓ Scaling transparency: increase in size of storage and network size should be
transparent.

The following are the characteristics of these computing installations and the specialized file
systems that have been developed to take advantage of them.

Physical Organization of Compute Nodes

• The new parallel-computing architecture, sometimes called cluster computing, is
organized as follows. Compute nodes are stored on racks, perhaps 8–64on a rack.
• The nodes on a single rack are connected by a network, typically gigabit Ethernet.
• There can be many racks of compute nodes, and racks are connected by another level of
network or a switch.

24
• The bandwidth of inter-rack communication is somewhat greater than the intra rack
Ethernet, but given the number of pairs of nodes that might need to communicate
between racks, this bandwidth may be essential.
• However, there may be many more racks and many more compute nodes per rack.
• It is a fact of life that components fail, and the more components, such as compute nodes
and interconnection networks, a system has, the more frequently something in the system
will not be working at any given time. Some important calculations take minutes or even
hours on thousands of compute nodes.
• If we had to abort and restart the computation every time one component failed, then the
computation might never complete successfully.
The solution to this problem takes two forms:
1. Files must be stored redundantly. If we did not duplicate the file at several
compute nodes, then if one node failed, all its files would be unavailable until the node is
replaced. If we did not back up the files at all, and the disk crashes, the files would be lost
forever.
2. Computations must be divided into tasks, such that if any one task fails
to execute to completion, it can be restarted without affecting other tasks. This strategy is
followed by the map-reduce programming system
Fig:

Architecture of a large-scale computing system

25
Large-Scale File-System Organization

• To exploit cluster computing, files must look and behave somewhat differently from the
conventional file systems found on single computers.
• This new file system, often called a distributed file system or DFS (although this term
had other meanings in the past), is typically used as follows.
• Files can be enormous, possibly a terabyte in size. If you have only small
files, there is no point using a DFS for them.
• Files are rarely updated. Rather, they are read as data for some calculation, and
possibly additional data is appended to files from time to time.
Files are divided into chunks, which are typically 64 megabytes in size. Chunks are
replicated, perhaps three times, at three different compute nodes.
• Moreover, the nodes holding copies of one chunk should be located on different racks, so
we don’t lose all copies due to a rack failure.
• Normally, both the chunk size and the degree of replication can be decided by the user.
• To find the chunks of a file, there is another small file called the master node or name
node for that file.
• The master node is itself replicated, and a directory for the file system as a whole knows
where to find its copies.
• The directory itself can be replicated, and all participants using the DFS know where the
directory copies are.

6. Explain the concept of Map – Reduce.

Map-reduce are a style of computing that has been implemented several times.
Use an implementation of map-reduce to manage many large-scale computations in a way that is
tolerant of hardware faults.
All you need to write are two functions, called Map and Reduce, while the system manages the
parallel execution, coordination of tasks that execute Map or Reduce. In brief, a map-reduce
computation executes as follows:
1. Some number of Map tasks each is given one or more chunks from a distributed file
system. These Map tasks turn the chunk into a sequence of key-value pairs. The way key-value
26
pairs are produced from the input data is determined by the code written by the user for the Map
function.
2. The key-value pairs from each Map task are collected by a master controller and sorted
by key. The keys are divided among all the Reduce tasks, so all key-value pairs with the same
key wind up at the same Re-duce task.
3. The Reduce tasks work on one key at a time, and combine all the values associated with
that key in some way. The manner of combination of values is determined by the code written by
the user for the Reduce function.

Fig: Map-Reduce computation

The Map Tasks

o We view input files for a Map task as consisting of elements, which can be any
type: a tuple or a document, for example.
• A chunk is a collection of elements, and no element is stored across two chunks.

27
• Technically, all inputs to Map tasks and outputs from Reduce tasks are of the key-value-
pair form, but normally the keys of input elements are not relevant and we shall tend to
ignore them.
• Insisting on this form for inputs and outputs is motivated by the desire to allow
composition of several maps - reduce processes.
A Map function is written to convert input elements to key-value pairs.
• The types of keys and values are each arbitrary.
• Further, keys are not “keys” in the usual sense; they do not have to be unique. Rather a
Map task can produce several key-value pairs with the same key, even from the same
element.

Grouping and Aggregation

• Grouping and aggregation is done the same way, regardless of what Map and Reduce
tasks do. The master controller process knows how many Reduce tasks there will be, say
r such tasks.
• The user typically tells the map-reduce system what r should be.
• Then the master controller normally picks a hash function that applies to keys and
produces a bucket number from 0 to r − 1. Each key that is output by a Map task is
hashed and its key-value pair is put in one of r local files.
• Each file is destined for one of the Reduce tasks.1After all the Map tasks have completed
successfully, the master controller merges the file from each Map task that is destined for
a particular Reduce task and feeds the merged file to that process as a sequence of key-
list-of-value pairs.
• That is, for each key k, the input to the Reduce task that handles key k is a pair of the
form (k, [v1, v2, . . . , vn]), where (k, v1), (k, v2), . . . , (k, vn) are all the key-value pairs
with key k coming from all the Map tasks.

The Reduce Tasks

o The Reduce function is written to take pairs consisting of a key and its list of
associated values and combine those values in some way.

28
• The output of a Reduce task is a sequence of key-value pairs consisting of each input key
k that the Reduce task received, paired with the combined value constructed from the list
of values that the Reduce task received along with key k.
• The outputs from all the Reduce tasks are merged into a single file.

Combiners

• It is common for the Reduce function to be associative and commutative. That is, the
values to be combined can be combined in any order, with the same result. It doesn’t
matter how we group a list of numbersv1, v2, . . . , vn; the sum will be the same.
• When the Reduce function is associative and commutative, it is possible to push some of
what Reduce does to the Map tasks.
• For example, instead of the Map tasks in Example 2.1 producing many pairs (w, 1), (w,
1), . . ., we could apply the Reduce function within the Map task, before the output of the
Map tasks is subject to grouping and aggregation.
• These key-value pairs would thus be replaced by one pair with key w and value equal to
the sum of all the 1’s in all those pairs.
• That is, the pairs with key w generated by a single Map task would be combined into a
pair (w,m), where m is the number of times that w appears among the documents handled
by this Map task.
• Note that it is still necessary to do grouping and aggregation and to pass the result to the
Reduce tasks, since there will typically be one key-value pair with key w coming from
each of the Map tasks.

Details of Map-Reduce Execution

o Let us now consider in more detail how a program using map-reduce is executed.
• Taking advantage of a library provided by a map-reduce system such as Hadoop, the user
program forks a Master controller process and some number of Worker processes at
different compute nodes.
• Normally, a Worker handles either Map tasks or Reduce tasks, but not both.

29
• The Master has many responsibilities. One is to create some number of Map tasks and
some number of Reduce tasks, these numbers being selected by the user program.
• These tasks will be assigned to Worker processes by the Master. It is reasonable to create
one Map task for every chunk of the input file(s), but we may wish to create fewer
Reduce tasks.
• The reason for limiting the number of Reduce tasks is that it is necessary for each Map
task to create an intermediate file for each Reduce task, and if there are too many Reduce
tasks the number of intermediate files explodes.
• A Worker process reports to the Master when it finishes a task, and a new task is
scheduled by the Master for that Worker process.
• Each Map task is assigned one or more chunks of the input file(s) and executes on it the
code written by the user.
• The Map task creates a file for each Reduce task on the local disk of the Worker that
executes the Map task.
• The Master is informed of the location and sizes of each of these files, and the Reduce
task for which each is destined.

Fig: Overview of the execution of a map-reduce program

30
• When a Reduce task is assigned by the master to a Worker process, that task is given all
the files that form its input.
• The Reduce task executes code written by the user and writes its output to a file that is
part of the surrounding distributed file system.

Coping With Node Failures

❖ The worst thing that can happen is that the compute node at which the Master is
executing fails. In this case, the entire map-reduce job must be restarted.
❖ But only this one node can bring the entire process down; other failures will be managed
by the Master, and the map-reduce job will complete eventually.
❖ Suppose the compute node at which a Map worker resides fails. This failure will be
detected by the Master, because it periodically pings the Worker processes.
❖ All the Map tasks that were assigned to this Worker will have to be redone, even if they
had completed.
❖ The reason for redoing completed Map tasks is that their output destined for the Reduce
tasks resides at that compute node, and is now unavailable to the Reduce tasks.
❖ The Master sets the status of each of these Map tasks to idle and will schedule them on a
Worker when one becomes available.
❖ The Master must also inform each Reduce task that the location of its input from that
Map task has changed. Dealing with a failure at the node of a Reduce worker is simpler.
❖ The Master simply sets the status of its currently executing Reduce tasks to idle. These
will be rescheduled on another reduce worker later.

Algorithms Using Map-Reduce

The Map Reduce algorithm contains two important tasks, namely Map and Reduce.

• The map task is done by means of Mapper Class

• The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it.

❖ The output of Mapper class is used as input by Reducer class, which in turn searches
matching pairs and reduces them.
31
❖ Map Reduce implements various mathematical algorithms to divide a task into small
parts and assign them to multiple systems.

❖ In technical terms, Map Reduce algorithm helps in sending the Map & Reduce tasks to
appropriate servers in a cluster.

❖ Map-reduce is not a solution to every problem, not even every problem that profitably
can use many compute nodes operating in parallel.
❖ Thus, we would not expect to use either a DFS or an implementation of map-reduce for
managing on-line retail sales, even though a large on-line retailer such as Amazon.com
uses thousands of compute nodes when processing requests over the Web.
❖ The reason is that the principal operations on Amazon data involve responding to
searches for products, recording sales, and so on, processes that involve relatively little
calculation and that change the database.
❖ On the other hand, Amazon might use map-reduce to perform certain analytic queries on
large amounts of data, such as finding for each user those users whose buying patterns
were most similar.
❖ The original purpose for which the Google implementation of map-reduce was created
was to execute very large matrix-vector multiplications as are needed in the calculation of
Page Rank.
❖ We shall see that matrix-vector and matrix-matrix calculations fit nicely into the map-
reduce style of computing.
❖ Another important class of operations that can use map-reduce effectively are the
relational-algebra operations. We shall examine the map-reduce execution of these
operations as well.

32
7. Illustrate the concepts of HADOOP YARN.

❖ Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster

management technology.

❖ YARN is one of the key features in the second-generation Hadoop 2 version of

the Apache Software Foundation's open source distributed processing framework.

❖ Originally described by Apache as a redesigned resource manager, YARN is now

characterized as a large-scale, distributed operating system for big
data applications.

❖ Sometimes called Map Reduce 2.0, YARN is a software rewrite that decouples
Map Reduce's resource management and scheduling capabilities from the data
processing component, enabling Hadoop to support more varied processing
approaches and a broader array of applications.

❖ For example, Hadoop clusters can now run interactive querying and streaming
data applications simultaneously with Map Reduce batch jobs.

❖ YARN combines a central resource manager that reconciles the way applications
use Hadoop system resources with node manager agents that monitor the
processing operations of individual cluster nodes.

❖ Running on commodity hardware clusters, Hadoop has attracted particular interest

as a staging area and data store for large volumes of structured and unstructured
data intended for use in analytics applications.

❖ Separating HDFS from Map Reduce with YARN makes the Hadoop environment
more suitable for operational applications that can't wait for batch jobs to finish.

Apache Hadoop Yarn – Concepts & Applications:

❖ As previously described, YARN is essentially a system for managing distributed

applications.
❖ It consists of a central Resource Manager, which arbitrates all available cluster
resources, and a per-node Node Manager, which takes direction from the
Resource Manager and is responsible for managing resources available on a
single node.

33
Resource Manager

❖ In YARN, the Resource Manager is, primarily, a pure scheduler.

❖ In essence, it’s strictly limited to arbitrating available resources in the system
among the competing applications – a market maker if you will.
❖ It optimizes for cluster utilization against various constraints such as capacity
guarantees, fairness, and SLAs.
❖ To allow for different policy constraints the Resource Manager has
a pluggable scheduler that allows for different algorithms such as capacity and
fair scheduling to be used as necessary.

Application Master

❖ The Application Master is, in effect, an instance of a framework-specific

library and is responsible for negotiating resources from the Resource Manager
and working with the NodeManager(s) to execute and monitor the containers and
their resource consumption.

❖ It has the responsibility of negotiating appropriate resource containers from the

Resource Manager, tracking their status and monitoring progress.

The Application Master allows YARN to exhibit the following key characteristics:

Scale:
❖ The Application Master provides much of the functionality of the traditional Resource
Manager so that the entire system can scale more dramatically.
❖ In tests, we’ve already successfully simulated 10,000 node clusters composed of modern
hardware without significant issue.
❖ This is one of the key reasons that we have chosen to design the Resource Manager as
a pure scheduler i.e. it doesn’t attempt to provide fault-tolerance for resources.
❖ We shifted that to become a primary responsibility of the Application Master instance.
Furthermore, since there is an instance of an Application Master per application, the
Application Master itself isn’t a common bottleneck in the cluster.

Open:

Moving all application framework specific code into the Application Master generalizes the
system so that we can now support multiple frameworks such as Map Reduce, MPI and Graph
Processing.

Resource Model

YARN supports a very general resource model for applications. An application can request
resources with highly specific requirements such as:

34
• Resource-name (hostname, rack name – we are in the process of generalizing this further
to support more complex network topologies with YARN-18).
• Memory (in MB)

• CPU (cores, for now)

• In future, expect us to add more resource-types such as disk/network I/O, GPUs etc.

Hadoop YARN architecture:

Fig: Hadoop YARN Architecture

• The Job Tracker is responsible for resource management (managing the worker nodes i.e.
Task Trackers), tracking resource consumption/availability and also job life-cycle
management (scheduling individual tasks of the job, tracking progress, providing fault-
tolerance for tasks etc).

• The Task Tracker has simple responsibilities – launch/teardown tasks on orders from the
Job Tracker and provide task-status information to the Job Tracker periodically.

35
How Yarn Works

YARN’s original purpose was to split up the two major responsibilities of the Job
Tracker/Task Tracker into separate entities:

❖ a global Resource Manager

❖ a per-application Application Master

❖ a per-node slave Node Manager

❖ a per-application Container running on a Node Manager

❖ The Resource Manager and the Node Manager formed the new generic system for
managing applications in a distributed manner.
❖ The Resource Manager is the ultimate authority that arbitrates resources among all
applications in the system.
❖ The Application Master is a framework-specific entity that negotiates resources from the
Resource Manager and works with the Node Manager(s) to execute and monitor the
component tasks.

❖ The Resource Manager has a scheduler, which is responsible for allocating resources to the
various applications running in the cluster, according to constraints such as queue capacities
and user limits.

❖ The scheduler schedules based on the resource requirements of each application.

❖ Each Application Master has responsibility for negotiating appropriate resource containers
from the scheduler, tracking their status, and monitoring their progress.

❖ From the system perspective, the Application Master runs as a normal container.

❖ The Node Manager is the per-machine slave, which is responsible for launching the
applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and
reporting the same to the Resource Manager.

Practicum 6
No ratings yet
Practicum 6
3 pages
cp5293 Big Data Analytics Question Bank
0% (1)
cp5293 Big Data Analytics Question Bank
13 pages
Big Data Analytics
No ratings yet
Big Data Analytics
31 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
CS8091 BDA Unit1
No ratings yet
CS8091 BDA Unit1
63 pages
Ds4015 Big Data Analytics QB
No ratings yet
Ds4015 Big Data Analytics QB
155 pages
BIG DATA AND ANALYTICS Presentation
No ratings yet
BIG DATA AND ANALYTICS Presentation
31 pages
Unit 1 - ETI (BDA)
No ratings yet
Unit 1 - ETI (BDA)
20 pages
Cp5293 Big Data Analytics Question Bank
0% (1)
Cp5293 Big Data Analytics Question Bank
13 pages
Introduction To Big Data and Hadoop
No ratings yet
Introduction To Big Data and Hadoop
31 pages
Big Data Analytics - Unit 1
No ratings yet
Big Data Analytics - Unit 1
43 pages
BIG Data1
No ratings yet
BIG Data1
49 pages
Introduction To Big Data Computing
No ratings yet
Introduction To Big Data Computing
25 pages
Seminar Report Alisha
No ratings yet
Seminar Report Alisha
22 pages
Unit 1
No ratings yet
Unit 1
76 pages
Unit I Introduction To Big Data: 1.1 Definition
No ratings yet
Unit I Introduction To Big Data: 1.1 Definition
16 pages
BCC (IEEE Format) Big Data
No ratings yet
BCC (IEEE Format) Big Data
2 pages
CS8091 BDA Unit I LectureNotes
No ratings yet
CS8091 BDA Unit I LectureNotes
73 pages
BDA Unit 1
No ratings yet
BDA Unit 1
39 pages
Unit-1 BDA
No ratings yet
Unit-1 BDA
14 pages
Bda PST
No ratings yet
Bda PST
11 pages
Big Data Analysis Seminar
100% (1)
Big Data Analysis Seminar
15 pages
ETEM S01 - (Big Data)
No ratings yet
ETEM S01 - (Big Data)
24 pages
UNIT-1:Overview of Big Data
No ratings yet
UNIT-1:Overview of Big Data
10 pages
Bda U1
No ratings yet
Bda U1
78 pages
FUNDAMENTALS OF BIG DATA ANALYTICS Digital Notes
No ratings yet
FUNDAMENTALS OF BIG DATA ANALYTICS Digital Notes
121 pages
Ccs334 BDA Important Questions
No ratings yet
Ccs334 BDA Important Questions
31 pages
Super Important Questions For BDA
100% (1)
Super Important Questions For BDA
26 pages
Big Data (My Notes)
No ratings yet
Big Data (My Notes)
30 pages
G12 It Unit 2
No ratings yet
G12 It Unit 2
30 pages
CS 329 Lecture One 2025
No ratings yet
CS 329 Lecture One 2025
28 pages
Course Project Workbook
No ratings yet
Course Project Workbook
29 pages
Bda Question Bank
No ratings yet
Bda Question Bank
10 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Bigdata Using Hadoop (Bca Bigdata)
No ratings yet
Bigdata Using Hadoop (Bca Bigdata)
39 pages
Business Analytics
No ratings yet
Business Analytics
34 pages
Big Data Tools and Applications Assignment
No ratings yet
Big Data Tools and Applications Assignment
10 pages
Big Data and Data Analysis: Offurum Paschal I Kunoch Education and Training College, Owerri
No ratings yet
Big Data and Data Analysis: Offurum Paschal I Kunoch Education and Training College, Owerri
35 pages
Bda Unit 1
No ratings yet
Bda Unit 1
20 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
Big Data Analytics
No ratings yet
Big Data Analytics
10 pages
Internal 1
No ratings yet
Internal 1
19 pages
Experiment No - 1 Bda
No ratings yet
Experiment No - 1 Bda
10 pages
BDA Unit 1
No ratings yet
BDA Unit 1
36 pages
Kwasu-Csc204 Big Data Computing and Security-1
No ratings yet
Kwasu-Csc204 Big Data Computing and Security-1
57 pages
Introduction To Business Analytics
No ratings yet
Introduction To Business Analytics
33 pages
CS8091 Big Data Analytics
No ratings yet
CS8091 Big Data Analytics
28 pages
Big Data Analytics
100% (1)
Big Data Analytics
3 pages
Big Data
No ratings yet
Big Data
24 pages
1 Bda
No ratings yet
1 Bda
41 pages
R II Bca IV Sem Unit 3 Balu Sir
No ratings yet
R II Bca IV Sem Unit 3 Balu Sir
14 pages
Big Data Analytics02
No ratings yet
Big Data Analytics02
20 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Unit 1 - BDS - DS307
No ratings yet
Unit 1 - BDS - DS307
47 pages
Big Data Question Bank
No ratings yet
Big Data Question Bank
38 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Big Data Unit
100% (1)
Big Data Unit
16 pages
What Is Big Data? Characteristics of Big Data and Significance
No ratings yet
What Is Big Data? Characteristics of Big Data and Significance
22 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Hass Planning t1 2023 Year 6
No ratings yet
Hass Planning t1 2023 Year 6
3 pages
HCIP-openGauss V1.0 Training Material
No ratings yet
HCIP-openGauss V1.0 Training Material
529 pages
Name: Sadikshya Khanal Section: C3G2: Workshop - 9 - Hadoop Part 2
No ratings yet
Name: Sadikshya Khanal Section: C3G2: Workshop - 9 - Hadoop Part 2
51 pages
Qualitative Research Report Writing
No ratings yet
Qualitative Research Report Writing
9 pages
0786 MDM HubIntegrationWithDataIntegrationServices-H2L
No ratings yet
0786 MDM HubIntegrationWithDataIntegrationServices-H2L
50 pages
SAP Leonardo Intro 2017.10.19 PDF
No ratings yet
SAP Leonardo Intro 2017.10.19 PDF
98 pages
Lecture 12: Database Security: Database System Concepts, 6 Ed
No ratings yet
Lecture 12: Database Security: Database System Concepts, 6 Ed
25 pages
A Study On Evaluating The Effectiveness of Training Programme in Birla Sun Life Insurance Company Private LTD, Kerala
No ratings yet
A Study On Evaluating The Effectiveness of Training Programme in Birla Sun Life Insurance Company Private LTD, Kerala
17 pages
Gelan Power Point Presentation
No ratings yet
Gelan Power Point Presentation
21 pages
Lecture-1 Introduction To DATAWAREHOUSING
No ratings yet
Lecture-1 Introduction To DATAWAREHOUSING
19 pages
(06) 分析輔助工具 - Power PI - 基礎入門介紹.zh-CN.en
No ratings yet
(06) 分析輔助工具 - Power PI - 基礎入門介紹.zh-CN.en
55 pages
DBMS Normalization
100% (1)
DBMS Normalization
4 pages
Questions For The May 2024 IDU
100% (3)
Questions For The May 2024 IDU
13 pages
TP Debug Info
No ratings yet
TP Debug Info
18 pages
A Guide To SQL Server 2000 Transactional and Snapshot Replication
100% (1)
A Guide To SQL Server 2000 Transactional and Snapshot Replication
86 pages
Identifying The Context: Your Sources in MLA Format in Your Bibliography at The End of The Document
No ratings yet
Identifying The Context: Your Sources in MLA Format in Your Bibliography at The End of The Document
11 pages
15.mviews 14
No ratings yet
15.mviews 14
12 pages
FINAL-ToR Consultant - Mid-Term Evaluation T05-EUTF-HOA-ET-76-01
No ratings yet
FINAL-ToR Consultant - Mid-Term Evaluation T05-EUTF-HOA-ET-76-01
6 pages
Mysql Queries For 2 Tables Practical
No ratings yet
Mysql Queries For 2 Tables Practical
4 pages
IA 04. Methodology Proposal
No ratings yet
IA 04. Methodology Proposal
4 pages
Unit 4 Transaction
No ratings yet
Unit 4 Transaction
7 pages
Sagnap Loay Voters List
No ratings yet
Sagnap Loay Voters List
17 pages
Hu Vehicle Management System Project Edited
No ratings yet
Hu Vehicle Management System Project Edited
27 pages
Buy Ebook Introducing Charticulator For Power BI: Design Vibrant and Customized Visual Representations of Data 1st Edition Alison Box Cheap Price
No ratings yet
Buy Ebook Introducing Charticulator For Power BI: Design Vibrant and Customized Visual Representations of Data 1st Edition Alison Box Cheap Price
49 pages
Project On Banking System
100% (1)
Project On Banking System
4 pages
Phan Quốc Hiếu - BH00658 - ASM 1 - Statistic Management
No ratings yet
Phan Quốc Hiếu - BH00658 - ASM 1 - Statistic Management
15 pages
Assignment 1 - Project Management & Contract Administration
No ratings yet
Assignment 1 - Project Management & Contract Administration
11 pages
Online Banking System
No ratings yet
Online Banking System
15 pages
Falsafah / Paradigma Penyelidikan
No ratings yet
Falsafah / Paradigma Penyelidikan
28 pages