0% found this document useful (0 votes)
74 views76 pages

Data Analytics CSE704 Module-1

This document outlines a course module on introduction to big data. The module objectives are to learn fundamental big data concepts, explore big data tools and practices, learn about stream computing, and understand research requiring large data integration. The module covers topics like big data characteristics, HDFS, MapReduce, clustering, classification, association rules, recommendation systems, NoSQL databases, and data visualization. Upon completing the course, students will be able to work with big data tools and techniques, perform various analytics algorithms, and manage and analyze large data streams.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views76 pages

Data Analytics CSE704 Module-1

This document outlines a course module on introduction to big data. The module objectives are to learn fundamental big data concepts, explore big data tools and practices, learn about stream computing, and understand research requiring large data integration. The module covers topics like big data characteristics, HDFS, MapReduce, clustering, classification, association rules, recommendation systems, NoSQL databases, and data visualization. Upon completing the course, students will be able to work with big data tools and techniques, perform various analytics algorithms, and manage and analyze large data streams.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Amity School of Engineering & Technology

Module-1
Introduction to Big Data
Data Analytics CSE704

By: Dr. Ghanshyam Prasad Dubey


Amity School of Engineering & Technology

• Course objectives:
Syllabus
– To know the fundamental concepts of big data and analytics.
– To explore tools and practices for working with big data
– To learn about stream computing.
– To know about the research that requires the integration of large
amounts of data
• Module I: Introduction to Big Data: (8 Hours)
– Evolution of Big data – Best Practices for Big data Analytics –
Big data characteristics – Validating – The Promotion of the
Value of Big Data – Big Data Use Cases- Characteristics of Big
Data Applications –Perception and Quantification of Value -
Understanding Big Data Storage – A General Overview of High-
Performance Architecture – HDFS – MapReduce and YARN –
Map Reduce Programming Model.
Amity School of Engineering & Technology

• Module II: Clustering and Classification: (6 Hours)


– Analytical Theory and Methods: Overview of Clustering – K-
means – Use Cases – Overview of the Method – Determining
the Number of Clusters – Diagnostics – Reasons to Choose and
Cautions .- Classification: Decision Trees – Overview of a
Decision Tree – The General Algorithm – Decision Tree
Algorithms – Evaluating a Decision Tree
• Module III: Association and Recommendation System: (8
Hours)
– Analytical Theory and Methods: Association Rules – Overview –
Apriori Algorithm – Evaluation of Candidate Rules – Applications
of Association Rules – Finding Association& finding similarity
Introduction to Streams Concepts – Stream Data Model and
Architecture – Stream Computing, Sampling Data in a Stream –
Filtering Streams – Counting Distinct Elements in a Stream –
Estimating moments ,Case Studies – Real Time Sentiment
Analysis. Using Graph Analytics for Big Data: Graph Analytics
Amity School of Engineering & Technology

• Module IV: NoSQL Data Management for


Big Data and Visualization: (8 Hours)
– NoSQL Databases: Schema-less Models‖:
Increasing Flexibility for Data Manipulation-
Key Value Stores-Document Stores – Tabular
Stores – Object Data Stores – Graph
Databases Hive – Sharding –- Hbase –
Analyzing big data with twitter – Big data for
E-Commerce Big data for blogs – Review of
Basic Data Analytic Methods using R.
Amity School of Engineering & Technology

Course Outcomes:
• Upon completion of the course, the
students will be able to:
– Work with big data tools and its analysis
techniques
– Analyze data by utilizing clustering and
classification algorithms
– Learn and apply different mining algorithms
and recommendation systems for large
volumes of data
– Perform analytics on data streams
– Learn NoSQL databases and management.
Amity School of Engineering & Technology

• Text Book:
– Anand Rajaraman and Jeffrey David Ullman, “Mining of Massive
Datasets”, Cambridge University Press, 2012.
– David Loshin, “Big Data Analytics: From Strategic Planning to
Enterprise Integration with Tools, Techniques, NoSQL, and Graph”,
Morgan Kaufmann/El sevier Publishers, 2013.
• References:
– EMC Education Services, “Data Science and Big Data Analytics:
Discovering, Analyzing, Visualizing and Presenting Data”, Wiley
publishers, 2015.
– Bart Baesens, “Analytics in a Big Data World: The Essential Guide to
Data Science and its Applications”, Wiley Publishers, 2015.
– Dietmar Jannach and Markus Zanker, “Recommender Systems: An
Introduction”, Cambridge University Press, 2010.
– Kim H. Pries and Robert Dunnigan, “Big Data Analytics: A Practical
Guide for Managers ” CRC Press, 2015.
– Jimmy Lin and Chris Dyer, “Data-Intensive Text Processing with
MapReduce”, Synthesis Lectures on Human Language Technologies,
Vol. 3, No. 1, Pages 1-177, Morgan Claypool publishers, 2010.
Amity School of Engineering & Technology

Big data
• Big Data refers to massive amounts of
data produced by different sources like
social media platforms, web logs, sensors,
IoT devices, and many more. It can be
either structured (like tables in DBMS),
semi-structured (like XML files), or
unstructured (like audios, videos, images).
• Traditional database management
systems are not able to handle this vast
amount of data. Big Data helps companies
to generate valuable insights.
Amity School of Engineering & Technology

Definitions
• According to Gartner, “Big Data is a High Volume, High
Velocity and High Variety Information Assets that
demand Cost Effective and innovative forms of
Information Processing that enable enhanced insight,
Decision Making and Process Automation”.
• According to Ernst and Young, “Big Data refers to the
Dynamic, Large and Desperate Volumes of Data being
created by People, Tools and Machines. It requires New,
Innovative and Scalable Technology to Collect, Host and
Analytically Process the vast amount of Data gathered in
order to derive Real Time Business Insights that relate to
Customers, Risk, Profit, Performance, Productivity
Management and Share Holder Value”.
Amity School of Engineering & Technology
Amity School of Engineering & Technology

Evolution of Big data


• Data Warehousing:
In the 1990s, data warehousing emerged as a
solution to store and analyze large volumes of
structured data.
• Hadoop:
Hadoop was introduced in 2006 by Doug Cutting
and Mike Cafarella. Distributed storage medium and
large data processing are provided by Hadoop, and
it is an open-source framework.
• NoSQL Databases:
In 2009, NoSQL databases were introduced, which
provide a flexible way to store and retrieve
unstructured data.
Amity School of Engineering & Technology
• Cloud Computing:
Cloud Computing technology helps companies
to store their important data in data centers that
are remote, and it saves their infrastructure cost
and maintenance costs.
• Machine Learning:
Machine Learning algorithms are those
algorithms that work on large data, and analysis
is done on a huge amount of data to get
meaningful insights from it. This has led to the
development of artificial intelligence (AI)
applications.
Amity School of Engineering & Technology

• Data Streaming:
Data Streaming technology has emerged
as a solution to process large volumes of
data in real time.
• Edge Computing:
Edge Computing is a kind of distributed
computing paradigm that allows data
processing to be done at the edge or the
corner of the network, closer to the source
of the data.
Amity School of Engineering & Technology

Short Story
• 1940s to 1989 – Data Warehousing and
Personal Desktop Computers
• 1989 to 1999 – Emergence of the World
Wide Web
• 2000s to 2010s – Controlling Data
Volume, Social Media and Cloud
Computing
• 2010s to now – Optimization Techniques,
Mobile Devices and IoT
Amity School of Engineering & Technology

Why Big Data?


• To understand Where, When and Why their customers
buy
• Protect the company’s client base with improved loyalty
programs
• Seizing cross-selling and up selling opportunities
• Provide targeted promotional information
• Optimize Workforce planning and operations
• Improve inefficiencies in the company’s supply chain
• Predict market trends
• Predict future needs
• Make companies more innovative and competitive
• It helps companies to discover new sources of revenue
Amity School of Engineering & Technology

IMPORTANCE OF BIG DATA


• Cost Savings
– Big Data tools like Apache Hadoop, Spark, etc. bring
cost-saving benefits to businesses when they have to
store large amounts of data. These tools help
organizations in identifying more effective ways of
doing business.
• Time-Saving
– Real-time in-memory analytics helps companies to
collect data from various sources. Tools like Hadoop
help them to analyze data immediately thus helping in
making quick decisions based on the learnings.
Amity School of Engineering & Technology

• Understand the market conditions


– Big Data analysis helps businesses to get a
better understanding of market situations. For
example, analysis of customer purchasing
behavior helps companies to identify the
products sold most and thus produces those
products accordingly.
• Social Media Listening
– Companies can perform sentiment analysis
using Big Data tools. These enable them to
get feedback about their company, that is,
who is saying what about the company.
Amity School of Engineering & Technology

• Boost Customer Acquisition and Retention


– Big data analytics helps businesses to identify
customer related trends and patterns. Customer
behavior analysis leads to a profitable business.
• Solve Advertisers Problem and Offer Marketing
Insights
– Big data analytics helps in changing the company’s
product line. It ensures powerful marketing
campaigns.
• The driver of Innovations and Product
Development
– Big data makes companies capable to innovate and
redevelop their products.
Amity School of Engineering & Technology
Amity School of Engineering & Technology

Big data Analytics


• Big data analytics helps businesses and
organizations make better decisions by
revealing information that would have
otherwise been hidden.
• Big data analytics is the use of advanced
analytic techniques against very large,
diverse data sets that include structured,
semi-structured and unstructured data,
from different sources, and in different
sizes from terabytes to zettabytes.
Amity School of Engineering & Technology

BIG DATA ANALYTICS EXAMPLES


• Big Data in the Airline Industry
– Airlines collect a large volume of data that
results from categories like customer flight
preferences, traffic control, baggage handling
and aircraft maintenance.
• Big Data in Banking
– Big data search analytics helps banks make
better financial decisions by providing insights
to massive amounts of unstructured data.
Amity School of Engineering & Technology

• Big Data in Government


– Big data analytics allows law enforcement to
work smarter and more efficiently.
• Big Data in Healthcare
– Big data analytics lets hospitals get important
insights out of what would have been an
unmanageable amount of data.
• Big Data in Manufacturing
– The supply chains of manufacturing are
complex and big data analytics allows
manufacturers to better understand how they
work.
Amity School of Engineering & Technology

• Big Data in Retail


– With big data analytics, retailers are able to
understand customer behavior and
preferences better than ever before.
• Big Data in the Sciences
– Big data visual analytics provides the insights
researchers need to try more trials faster. It
allows for automated solutions that affect
speed and efficiency.
Amity School of Engineering & Technology

Best Practices for Big data Analytics


• Big data analytics basic concepts use data
from both internal and external sources.
• When real-time big data analytics are
needed, data flows through a data store via a
stream processing engine like Spark.
• Raw data is analyzed on the spot in the
Hadoop Distributed File System, also known
as a data lake.
• It is important that the data is well organized
and managed to achieve the best
performance.
Amity School of Engineering & Technology

1. Establish big data business objectives


2. Collaborate with partners to assess the
situation and plan
3. Find out the data you already have and
what you need
4. Maintain an ongoing dialogue
5. Start slowly and move quickly in later
stages
6. Analyze the demands on big data
technology
7. Align with cloud-based big data
Amity School of Engineering & Technology

• Data is analyzed the following ways:


– Data mining
• Uses big data mining and analytics to sift through
data sets in search of patterns and relationships.
– Big data predictive analytics
• Builds models to forecast customer behaviour.
– Machine learning
• Taps algorithms to analyze large data sets.
– Deep learning
• An advanced version of machine learning, in which
algorithms can determine the accuracy of a
prediction on their own.
Amity School of Engineering & Technology

Big data characteristics


• Big Data contains a large amount of data that is
not being processed by traditional data storage
or the processing unit.
• It is used by many multinational
companies to process the data and business of
many organizations.
– Volume
– Veracity
– Variety
– Value
– Velocity
Amity School of Engineering & Technology

Volume
• Big Data is a vast 'volumes' of data generated
from many sources daily, such as business
processes, machines, social media
platforms, networks, human interactions, and
many more.
• Facebook can generate approximately
a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350
million new posts are uploaded each day. Big
data technologies can handle large amounts of
data.
Amity School of Engineering & Technology

Veracity
• Veracity means how much the data is
reliable.
• It has many ways to filter or translate the
data.
• Veracity is the process of being able to
handle and manage data efficiently.
Amity School of Engineering & Technology

Variety
• Big Data can be structured,
unstructured, and semi-structured that
are being collected from different sources.
Data will only be collected
from databases and sheets in the past,
But these days the data will comes in
array forms, that are PDFs, Emails,
audios, SM posts, photos, videos, etc.
Amity School of Engineering & Technology

Value
– It is an essential characteristic of big data. It is
not the data that we process or store. It
is valuable and reliable data that we store,
process, and also analyze.
Amity School of Engineering & Technology

Velocity
– It creates the speed by which the data is
created in real-time. It contains the linking of
incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of
Big Data is to provide demanding data rapidly.
Amity School of Engineering & Technology

The Promotion of the Value of Big Data


• The primary reason why Big Data has
developed rapidly over the last years is
because it provides long-term enterprise
value.
• Value is captured both, in terms of immediate
social or monetary gain, and in the form of a
strategic competitive advantage.
• There are various ways in which value can be
captured through Big Data and how
enterprises can leverage to facilitate growth
or become more efficient.
Amity School of Engineering & Technology

• 1) Creating transparency
– Big Data is analyzed across different boundaries and
can identify a variety of inefficiencies. In
manufacturing organizations, for example, Big Data
can help identify improvement opportunities across
R&D, engineering and production departments in
order to bring new products faster to market.
• 2) Data driven discovery
– Big Data can provide tremendous new insights that
might have not been identified previously by finding
patterns or trends in data sets. In the insurance
industry for example, Big Data can help to determine
profitable products and provide improved ways to
calculate insurance premiums.
Amity School of Engineering & Technology

• 3) Segmentation and customization


– The analysis of Big Data provides an improved
opportunity to customize product-market offerings to
specified segments of customers in order to increase
revenues. Data about user or customer behavior
makes it possible to build different customer profiles
that can be targeted accordingly.
• 4) The power of automation
– Automation can optimize enterprise processes and
improve accuracy or response times. Retailers, for
example, can leverage Big Data algorithms to make
purchasing decisions or determine how much stock
will provide an optimal rate of return.
Amity School of Engineering & Technology

• 5) Innovation and new products


– By analyzing purchasing data or search
volumes, organizations can identify demand
for products that the organization might be
unaware of.
Amity School of Engineering & Technology

Big Data Use Cases


• With this rapidly growing big data market,
organizations are leveraging big data to gain
insights that help them make better decisions,
improve operations and ultimately drive optimal
growth.
• Real-Time Big Data Use Cases Across
Industries
– Retailers.
– Healthcare
– Financial institutions
– Manufacturing
– Government agencies
Amity School of Engineering & Technology

Big Data Use Cases In Retail


• Retailers analyze big data to understand
customer preferences and buying
patterns, enabling targeted marketing
campaigns and personalized
recommendations.
– Personalized Recommendations
– Inventory Optimization
– Price Optimization
– Supply Chain Management
– Fraud Detection
– Market Trend Analysis
Amity School of Engineering & Technology

Big Data Use Cases In Healthcare


• Healthcare organizations leverage big
data to improve patient outcomes by
identifying trends, predicting disease
outbreaks, and optimizing treatment plans
based on large-scale data analysis.
– Predictive Analytics (Disease based on history)
– Personalized Medicine (based on genetic profile)
– Telemedicine And Remote Patient Monitoring
– Health Data Analytics
– Drug Discovery And Development
– Operational Efficiency (streamline operations)
Amity School of Engineering & Technology

Big Data Use Cases In Banking And


Financial Services
• Financial institutions utilize big data to
detect fraudulent activities, manage risk,
and make data-driven investment
decisions.
– Fraud Detection And Prevention
– Risk Management
– Customer Analytics
– Compliance And Regulatory Reporting
– Trading And Investment Analytics
– Loan Management
Amity School of Engineering & Technology

Big Data Use Cases In Media And


Entertainment
• Massive data is generated daily, allowing
media companies to understand their
audience better and customize their
content to maximize engagement and
revenue.
– Content Recommendation
– Advertising Optimization
– Predictive Analytics (future content)
– Performance Tracking (Content)
Amity School of Engineering & Technology

Big Data Use Cases In Telecom Industry


• The telecom industry generates vast amounts of
data, from call records and network performance to
customer behavior and preferences.
• Telecom service providers use this data to gain
insights into consumer behavior, optimize network
performance, identify and prevent fraud, and create
customized marketing campaigns.
– Network Optimization
– Enhanced Customer Experience
– Fraud Prevention And Detection (unauthorized network
access, hacking, and subscription fraud)
– Marketing And Sales
– Predictive Maintenance
Amity School of Engineering & Technology

Big Data Use Cases In Supply Chain And


Manufacturing
• Big data analytics is revolutionizing supply chain
and manufacturing businesses by offering
insights into their operations, identifying
inefficiencies, and enhancing performance.
– Predictive Maintenance (potential equipment failure)
– Quality Control
– Inventory Management
– Better Product Design
Amity School of Engineering & Technology

PERCEPTION AND QUANTIFICATION OF


VALUE
• Increasing revenues: As an example, an
expectation of using a recommendation engine
would be to increase same-customer sales by
adding more items into the market basket.
• Lowering costs: As an example, using a big
data platform built on commodity hardware for
ETL would reduce or eliminate the need for
more specialized servers used for data staging,
thereby reducing the storage footprint and
reducing operating costs.
Amity School of Engineering & Technology

• Increasing productivity: Increasing the speed


for the pattern analysis and matching done for
fraud analysis helps to identify more instances of
suspicious behavior faster, allowing for actions
to be taken more quickly and transform the
organization from being focused on recovery of
funds to proactive prevention of fraud.
• Reducing risk: Using a big data platform or
collecting many thousands of streams of
automated sensor data can provide full visibility
into the current state of a power grid, in which
unusual events could be rapidly investigated to
determine if a risk of an imminent outage can be
reduce.
Amity School of Engineering & Technology

CHARACTERISTICS OF BIG DATA


APPLICATIONS
• Data throttling: The business challenge has an
existing solution, but on traditional hardware, the
performance of a solution is throttled as a result
of data accessibility, data latency, data
availability, or limits on bandwidth in relation to
the size of inputs.
• Computation-restricted throttling: There are
existing algorithms, but they are heuristic and
have not been implemented because the
expected computational performance has not
been met with conventional systems.
Amity School of Engineering & Technology

• Large data volumes: The analytical application


combines a multitude of existing large datasets
and data streams with high rates of data creation
and delivery.
• Significant data variety: The data in the
different sources vary in structure and content,
and some (or much) of the data is unstructured.
• Benefits from data parallelization: Because of
the reduced data dependencies, the
application‘s runtime can be improved through
task or thread-level parallelization applied to
independent data segments
Amity School of Engineering & Technology

Understanding Big Data Storage


• Big data storage is a compute-and-storage architecture
that collects and manages large data sets and
enables real-time data analytics.
• Big data storage is a complicated problem. There are
many things to consider when building the infrastructure
but there are three key considerations:
– Data velocity: Your data must be able to move quickly between
processing centers and databases for it to be helpful in real-time
applications.
– Scalability: The system should be able to expand as your
business does and accommodate new projects as needed
without disrupting existing workflows or causing any downtime.
– Cost efficiency: Because big data projects can be so
expensive, choosing a system that reduces costs without
sacrificing the quality of service or functionality is essential.
Amity School of Engineering & Technology

• Data Storage Methods in big data


storage
– Warehouse Storage: Warehouse storage is
one of the more common ways to store large
amounts of data, but it has drawbacks.
– Cloud Storage: Cloud storage is an
increasingly popular option since it's easier
than ever to use this method. With AWS, you
can store unlimited data without worrying
about how much space each file takes up on
their servers. They'll automatically compress
them before sending them over, so they take
up less space overall!
Amity School of Engineering & Technology

Data Storage Technologies


• Hadoop: A distributed processing framework based on
open-source software, Hadoop enables large data sets to be
processed across clusters of computers. Large data sets
were initially intended to be processed and stored across
clusters of commodity hardware.
• Hbase: This database is designed to efficiently manage
large tables with billions of rows and millions of columns.
The performance can be tuned by adjusting memory usage,
the number of servers, block size, and other settings.
• Snowflake: Snowflake for Data Lake Analytics is an
enterprise-grade cloud platform for advanced analytics
applications built on top of Apache Hadoop. It offers real-
time access to historical and streaming data from any source
and format at any scale without requiring changes to existing
applications or workflows.
Amity School of Engineering & Technology

High-Performance Computing
• HPC environments are designed for high-speed floating
point processing and much of the calculations are done
in memory, which renders the highest computational
performance possible.
• Cray Computers and IBM Blue Gene are examples of
HPC environments.
• HPC environments are predominantly used by research
organizations and by business units that demand very
high scalability and computational performance, where
the value being created is so huge and strategic that
cost is not the most important consideration.
• While HPC environments have been around for quite
some time, they are used for specialty applications and
primarily provide a programming environment for custom
application development.
Amity School of Engineering & Technology

Hadoop
• Hadoop is an open-source software framework for
storing data and running applications on clusters of
commodity hardware. It provides massive storage for
any kind of data, enormous processing power and the
ability to handle virtually limitless concurrent tasks or
jobs.
Why is Hadoop important?
• Ability to store and process huge amounts of any
kind of data, quickly. With data volumes and varieties
constantly increasing, especially from social media and
the Internet of Things (IoT), that's a key consideration.
• Computing power. Hadoop's distributed computing
model processes big data fast. The more computing
nodes you use, the more processing power you have.
Amity School of Engineering & Technology

• Fault tolerance. Data and application processing are


protected against hardware failure. If a node goes down,
jobs are automatically redirected to other nodes to make
sure the distributed computing does not fail. Multiple
copies of all data are stored automatically.
• Flexibility. Unlike traditional relational databases, you
don’t have to preprocess data before storing it. You can
store as much data as you want and decide how to use it
later. That includes unstructured data like text, images
and videos.
• Low cost. The open-source framework is free and uses
commodity hardware to store large quantities of data.
• Scalability. You can easily grow your system to handle
more data simply by adding nodes. Little administration
is required.`
Amity School of Engineering & Technology

Hadoop Architecture
Amity School of Engineering & Technology

Hadoop Components
• The Hadoop Architecture Mainly consists
of 4 components.
• MapReduce
• HDFS(Hadoop Distributed File System)
• YARN(Yet Another Resource Negotiator)
• Common Utilities or Hadoop Common
Amity School of Engineering & Technology

The Hadoop Distributed File System


(HDFS)
• HDFS is the storage system for a Hadoop cluster.
• When data lands in the cluster HDFS breaks it into
pieces and distributes those pieces among the
different servers participating in the cluster.
• HDFS breaks it into pieces and distributes those
pieces among servers participating.
• HDFS has a master/slave architecture.
• An HDFS cluster consists of a single NameNode,
a master server that manages the file system
namespace and regulates access to files by
clients.
Amity School of Engineering & Technology
Amity School of Engineering & Technology

NameNode and DataNodes


• HDFS exposes a file system namespace and allows
user data to be stored in files.
• Internally, a file is split into one or more blocks and
these blocks are stored in a set of DataNodes.
• The NameNode executes file system namespace
operations like opening, closing, and renaming files
and directories. It also determines the mapping of
blocks to DataNodes.
• The DataNodes are responsible for serving read and
write requests from the file system’s clients.
• The DataNodes also perform block creation,
deletion, and replication upon instruction from the
NameNode.
Amity School of Engineering & Technology

Data Replication
• HDFS is designed to reliably store very large
files across machines in a large cluster.
• It stores each file as a sequence of blocks. The
blocks of a file are replicated for fault tolerance.
• The block size and replication factor are
configurable per file.
• All blocks in a file except the last block are the
same size, while users can start a new block
without filling out the last block to the configured
block size after the support for variable length
block was added to append and hsync.
Amity School of Engineering & Technology

• An application can specify the number of


replicas of a file. The replication factor can be
specified at file creation time and can be
changed later. Files in HDFS are write-once
(except for appends and truncates) and have
strictly one writer at any time.
• The NameNode makes all decisions regarding
replication of blocks. It periodically receives a
Heartbeat and a Blockreport from each of the
DataNodes in the cluster. Receipt of a Heartbeat
implies that the DataNode is functioning
properly. A Blockreport contains a list of all
blocks on a DataNode.
Amity School of Engineering & Technology
Amity School of Engineering & Technology

MapReduce
• MapReduce nothing but just like an Algorithm or
a data structure that is based on the YARN
framework.
• The major feature of MapReduce is to perform
the distributed processing in parallel in a
Hadoop cluster which Makes Hadoop working
so fast.
• MapReduce has mainly 2 tasks which are
divided phase-wise:
• In first phase, Map is utilized and in next
phase Reduce is utilized.
Amity School of Engineering & Technology
Amity School of Engineering & Technology

Map Task
• RecordReader The purpose of recordreader is to break the records. It
is responsible for providing key-value pairs in a Map() function. The
key is actually is its locational information and value is the data
associated with it.
• Map: A map is nothing but a user-defined function whose work is to
process the Tuples obtained from record reader. The Map() function
either does not generate any key-value pair or generate multiple pairs
of these tuples.
• Combiner: Combiner is used for grouping the data in the Map
workflow. It is similar to a Local reducer. The intermediate key-value
that are generated in the Map is combined with the help of this
combiner. Using a combiner is not necessary as it is optional.
• Partitionar: Partitionar is responsible for fetching key-value pairs
generated in the Mapper Phases. The partitioner generates the
shards corresponding to each reducer. Hashcode of each key is also
fetched by this partition. Then partitioner performs it’s (Hashcode)
modulus with the number of reducers (key.hashcode()%(number of
reducers)).
Amity School of Engineering & Technology

Reduce Task
• Shuffle and Sort: The Task of Reducer starts with this step,
the process in which the Mapper generates the intermediate
key-value and transfers them to the Reducer task is known
as Shuffling. Using the Shuffling process the system can sort
the data using its key value. Once some of the Mapping tasks
are done Shuffling begins that is why it is a faster process and
does not wait for the completion of the task performed by
Mapper.
• Reduce: The main function or task of the Reduce is to gather
the Tuple generated from Map and then perform some sorting
and aggregation sort of process on those key-value
depending on its key element.
• OutputFormat: Once all the operations are performed, the
key-value pairs are written into the file with the help of record
writer, each record in a new line, and the key and value in a
space-separated manner.
Amity School of Engineering & Technology
Amity School of Engineering & Technology

YARN(Yet Another Resource Negotiator)


• YARN is a Framework on which MapReduce works.
• YARN performs 2 operations that are Job scheduling
and Resource Management.
• The Purpose of Job schedular is to divide a big task into
small jobs so that each job can be assigned to various
slaves in a Hadoop cluster and Processing can be
Maximized.
• Job Scheduler also keeps track of which job is important,
which job has more priority, dependencies between the
jobs and all the other information like job timing, etc.
• And the use of Resource Manager is to manage all the
resources that are made available for running a Hadoop
cluster.
Amity School of Engineering & Technology
Amity School of Engineering & Technology

Features of YARN
• Multi-Tenancy (Multi user with single interface)
• Scalability
• Cluster-Utilization
• Compatibility
Amity School of Engineering & Technology

Hadoop common or Common Utilities


• Hadoop common or Common utilities are
nothing but java library and java files or the
java scripts that are needed for all the other
components present in a Hadoop cluster.
• These utilities are used by HDFS, YARN, and
MapReduce for running the cluster.
• Hadoop Common verify that Hardware failure
in a Hadoop cluster is common so it needs to
be solved automatically in software by
Hadoop Framework.
Amity School of Engineering & Technology

MapReduce Programming Model


• MapReduce: Is a programming model that
allows us to perform parallel processing across
Big Data using a large number of nodes
(multiple computers).
• Cluster Computing: nodes are homogeneous
and located on the same local network.
• Grid Computing: nodes are heterogeneous
(different hardware) and located geographically
far from each other.
Amity School of Engineering & Technology
Amity School of Engineering & Technology

MapReduce programming Steps:

• Input Split: In this step raw input data is being


divided into chunks called input splits (size
between 16 MB- 64 MB). Data input of this step
will from the form (key1, Value1).
• Mapping: A node who is assigned a map
function takes the input and emits a set of (key2,
value2) pairs. One of the nodes in the cluster is
special — Master Node— it assigns the work to
the worker nodes -Slave nodes- and make sure
that the job is done by these slaves.
Amity School of Engineering & Technology

MapReduce programming Steps


• Shuffling: In this step the output of the mapping
function is being grouped by keys and
redistributed in a way that all data with the same
key are located on the same node. The output of
this step will be (k2, list(v2)).
• Reducing: Nodes now process each group of
output data by aggregating values of shuffle
phase output. The final output will be from the
shape of (list (k3, v3)).
Amity School of Engineering & Technology

Phases of MapReduce
Amity School of Engineering & Technology

Code
Amity School of Engineering & Technology

Further Study
• EMC Education Services, “Data Science and Big Data
Analytics: Discovering, Analyzing, Visualizing and Presenting
Data”, Wiley publishers, 2015.
• Bart Baesens, “Analytics in a Big Data World: The Essential
Guide to Data Science and its Applications”, Wiley Publishers,
2015.
• https://www.projectpro.io/article/5-big-data-use-cases-how-
companies-use-big-data/155
• https://www.geeksforgeeks.org/mapreduce-architecture/
• https://www.techtarget.com/searchbusinessanalytics/definition
/big-data-analytics

You might also like