0% found this document useful (0 votes)
29 views44 pages

Big Data Analytics QB

Uploaded by

Naga Rajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views44 pages

Big Data Analytics QB

Uploaded by

Naga Rajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

FUNDAMENTALS OF BIG DATA ANALYTICS

UNIT-1
Types of Digital Data: Classification of Digital Data.
Introduction to Big Data: Characteristic of Data, Evolution of Big Data, Definition of Big
Data, Challenges with Big Data, What is Big Data?.
Big Data Analytics: Where do we Begin?, What is Big Data Analytics?, What Big Data
Analytics isn’t?, Classification of Analytics, Terminologies Used in Big Data Environments.
The Big Data Technology Landscape: NoSQL

Q) What are the characteristics of data?

 Data is a collection of details in the form of either figures or texts or


symbols, or descriptions etc.
 Data contains raw figures and facts. Information unlike data provides
insights analyzed through the data collected.

Data has 3 characteristics:


1. Composition: The composition of data deals with the structure of data,
i.e; the sources of data, the granularity, the types and nature of data
as to whether it is static or real time streaming.
2. Condition: The condition of data deals with the state of data, i.e; “Can
one use this data as is for analysis?” or “Does it require cleaning for
further enhancement and enrichment?” data?”
3. Context: The context of data deals with “Where has this data been
generated?”. “Why was this data generated?”, “How sensitive is this
data?”, “What are the events associated with this”.

Q) What is digital data? Explain different types of digital data.

The data that is stored using specific machine language systems which can
be interpreted by various technologies is called digital data.
Eg. Audio, video or text information

Digital Data is classified into three types:


1. Structured Data:-
This is the data which is in an organized form, for example in rows
and columns.
No of rows called Cardinality and No of columns called Degree of a
relation
Sources: Database, Spread sheets, OLTP systems.

Working with Structured data:


- Storage: Data types – both defined and user defined help with the
storage of structured data
- update, delete: Updating, deleting, etc. is easy due to structured form
- Security: can be provided easily in RDBMS.
- Indexing /Searching: Data can be indexed based not only on a
text string but other attributes as well. This enables streamlined search
- Scalability (horizontal/vertical): Scalability is not generally an issue
with increase in data as resources can be increased easily.
- Transaction Processing (Atomicity, Consistency, Integrity, Durability)
Fig. Sample representation of types of digital data

2. Semi-Structured Data:

This data which doesn’t conform to a data model but has some structure.
Metadata for this data is available but is not sufficient.
Sources: XML, JSON, E-mail

Characteristics:
- inconsistent structure.
- self describing (label/value pairs)
- schema information is blended with data values
- data objectives may have different attributes not known before

Challenges:
 Storage cost: Storing data with their schemas increases cost
 RDBMS: Semi-structured data cannot be stored in existing RDBMS as
data cannot be mapped into tables directly
 Irregular and partial structure: Some data elements may have extra
information while others none at all
 Implicit structure: In many cases the structure is implicit.
 Interpreting relationships and correlations is very difficult
 Flat files: Semi-structured is usually stored in flat files which are
difficult to index and search
 Heterogeneous sources: Data comes from varied sources which is
difficult to tag and search.

3. Unstructured Data:

 This is the data which does not conform to a data model or is not in a
form which can be used easily by a computer program.
 About 80–90% data of an organization is in this format.
 Sources: memos, chat-rooms, PowerPoint presentations, images,
videos, letters, researches, white papers, body of an email, etc.

Characteristics:
 Does not confirm to any data model
 Can’t be stored in the form of rows and columns
 Not in any particular format or sequence
 Not easily usable by the program
 Doesn’t follow any rule or semantics
Challenges:
 Storage space: Sheer volume of unstructured data and its
unprecedented growth makes it difficult to store. Audios, videos,
images, etc. acquire huge amount of storage space
 Scalability: Scalability becomes an issue with increase in
unstructured data
 Retrieve information: Retrieving and recovering unstructured data are
cumbersome
 Security: Ensuring security is difficult due to varied sources of data
(e.g. e-mail, web pages)
 Update/delete: Updating, deleting, etc. are not easy due to the
unstructured form
 Indexing and Searching: Indexing becomes difficult with increase in
data.
 Searching is difficult for non-text data
 Interpretation: Unstructured data is not easily interpreted by
conventional search algorithm
 Tags: As the data grows it is not possible to put tags Manually
 Indexing: Designing algorithms to understand the meaning
of the document and then tag or index them accordingly is difficult.

Dealing with Unstructured data:


 Data Mining: Knowledge discovery in databases, popular Mining
algorithms are Association rule mining, Regression Analysis, and
Collaborative filtering
 Natural Language Processing: It is related to HCI. It is about
enabling computers to understand human or natural language input.
 Text Analytics: Text mining is the process of gleaning high quality
and meaningful information from text. It includes tasks such as text
categorization, text clustering, sentiment analysis and concept/entity
extraction.
 Noisy text analytics: Process of extraction structured or semi-
structured from noisy unstructured data such as chats, blogs, wikis,
emails, Spelling mistakes, abbreviations, uh, hm, non standard
words.
 Manual Tagging with meta data: This is about tagging manually with
adequate meta data to provide the requisite semantics to understand
unstructured data.
 Parts of Speech Tagging: POST is the process of reading text and
tagging each word in the sentence belonging to particular parts of
speech such as noun, verb, objective.
 Unstructured Information management architecture: Open source
platform from IBM used for real time content analytics.

Q) Define Big Data. What are the characteristics of Big Data?

Big Data is high-volume, velocity, and variety information assets that


demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.
Characteristics(V’s):
1. Volume: It refers to the amount of the data. The size of the data is
being increased from Bits to Yottabytes.
Bits-> Bytes-> KBs-> MBs-> GBs-> TBs-> PBs-> Exabytes-> Zettabytes->
Yottabytes

There are different sources of data like doc, pdf, YouTube, a chat
conversation on internet messenger, a customer feedback form on an online
retail website, CCTV coverage and weather forecast.

The sources of Big data:


1. Typical internal data sources: data present within an organization’s
firewall.
Data storage: File systems, SQL (RDBMSs- oracle, MS SQL server, DB2,
MySQL, PostgreSQL etc.) NoSQL, (MangoDB, Cassandra etc) and so on.
Archives: Archives of scanned documents, paper archives, customer
correspondence records, patient’s health records, student’s admission
records, student’s assessment records, and so on.
2. External data sources: data residing outside an organization’s
Firewall.
Public web: Wikipedia, regulatory, compliance, weather, census etc.,
3. Both (internal + external sources)
Sensor data, machine log data, social media, business apps, media and
docs.

2. Variety: Variety deals with the wide range of data types and
sources of data. Structured, semi-structured and Unstructured.
Structured data: From traditional transaction processing systems and
RDBMS, etc.
Semi-structured data: For example Hypertext Markup Language (HTML),
eXtensible Markup Language (XML).
Unstructured data: For example unstructured text documents, audios,
videos, emails, photos, PDFs , social media, etc.
3. Velocity: It refers to the speed of data processing. we have
moved from the days of batch processing to Real-time processing.

Fig. 3 V’s of Big Data


Another V’s in Big Data they are
4. Veracity: Veracity refers to biases, noise and abnormality in data. The
key question is “Is all the data that is being stored, mined and analysed
meaningful and pertinent to the problem under consideration”.
5. Value: This refers to the value that big data can provide, and it relates
directly to what organizations can do with that collected data. It is often
quantified as the potential social or economic value that the data might
create.
6. Volatility: It deals with “How long the data is valid? “
7. Validity: Validity refers to accuracy & correctness of data. Any data
picked up for analysis needs to be accurate.
8. Variability: Data flows can be highly inconsistent with periodic peaks.

Q) How is traditional BI environment different from Big data


environment?
Traditional BI Environment Big Data Environment
Data is stored in central server Data is stored in a distributed file
system.
Server scales vertically Distributed file system scales by
horizontally.
Analyzes offline or historic data Analyzes real or streaming data
Supports Strucuted data only Supports variety of data ie;
structured, semi-structured and
unstructured data.

Q) Explain evolution of Big Data. What are the challenges of Big Data?

Evolution:
Data Generation Data Utilization Data Driven
and storage
Complex and Structured data,
unstructured Unstructured data,
Multimedia data
Complex and Relational
Relational databases : Data
intensive
applications
Primitive and Main frames: Basic
structured data storage

1970s and before Relational 2000s and beyond


1980s and 1990s

The challenges with big data:


1. Data today is growing at an exponential rate. Most of the data that we have today has
been generated in the last two years. The key question is : will all this data be useful
for analysis how will separate knowledge from noise.
2. How to host big data solutions outside the world.
3. The period of retention of big data.
4. Dearth of skilled professionals who possess a high level of proficiency in data science
that is vital in implementing Big data solutions.
5. Challenges with respect to capture, curation, storage, search, sharing, transfer,
analysis, privacy violations and visualization.
6. Shortage of data visualization experts.
7. Scale : The storage of data is becoming a challenge for everyone.
8. Security: The production of more and more data increases security and privacy
concerns.
9. Schema: there is no place for rigid schema, need of dynamic schema.
10. Continuous availability: How to provide 24X7 support
11. Consistency: Should one opt for consistency or eventual consistency.
12. Partition tolerant: how to build partition tolernant systems that can take of both
hardware and software failures.
13. Data quality: Inconsistent data, duplicates, logic conflicts, and missing data all result
in data quality challenges.

Q) Define Big Data Analytics. What are the various types of analytics?

Big Data Analytics is the process of examining big data to uncover


patterns, unearth trends, and find unknown correlations and other useful
information to make faster and better decisions.
Few Top Analytics tools are: MS Excel, SAS, IBM SPSS Modeler, R analytics,
Statistica, World Programming Systems (WPS), and Weka.
The open source analytics tools are: R analytics and Weka.

Big Data Analytics is:


1. Technology enabled analytics: The analytical tools help to
process and analyze big data.
2. About gaining a meaningful, deeper, and richer insights inot
business to drive in right direction, understanding the
customer’s demographics, better leveraging the services of
vendors and suppliers etc.
3. About a competitive edge over the competitors by enabling with
finding that allow quicker and better decision making.
4. A tight handshake between 3 communities: IT, Business users
and Data Scientists.
5. Working with datasets whose volume and variety exceed the
current storage and processing capabilities and infrastructure
of the enterprise.
6. About moving code to data. This makes perfect sense as the
program for distributed processing is tiny compared to the data.

Classification of Analytics: There are basically two schools of thought:


1. Those that classify analytics into basic, operational, advanced and
monetized.
2. Those that classify analytics into analytics 1.0, analytics 2.0 and
analytics 3.0.
First school of thought:
1. Basic analytics: This primarily slicing and slicing of data to help with
basic business insights. This is about reporting on historical data,
basic visualization etc.
2. Operationalized Analytics: It is operationalized analytics if it gets
woven into the enterprise’s business process.
3. Advanced Analytics: This largely is about forecasting for the future by
way of predictive and prescriptive modeling.
4. Monetized analytics: This is analytics in use to derive direct business
revenue.

Second school of thought:


Analytics 1.0 Analytics 2.0 Analytics 3.0
Era: 1950s to Era: 2005 to 2012 Era: 2012 to present
2009
Descriptive Descriptive statistics Descriptive statistics +
statistics (report + Predictive statistics Predictive statistics +
events, (use data from the prescriptive statistics
occurrences etc of past to make (use data from the past to make
the past. predictions for the prophecies for the future and at
future. the same time make
recommendations to leverage
the situations to one’s
advantage.
Key questions Key questions are: Key questions are:
asked: What will happen? What will happen?
What happened? Why will it happen? When will it happen?
Why did it Why will it happen?
happen? What should be the action
taken to take advantage of what
will happen?

Data from legacy Big Data A blend of big data and data
systems, from legacy systems, ERP,CRM
ERP,CRM and and third party applications.
third party
applications.
Small and Big data is being A blend of big data and
structured data taken up seriously. traditional analytics to yield
sources. Data Data is mainly insights and offerings with
stored in unstructured, speed and impact.
enterprise data arriving at a higher
warehouses or pace. This fast flow of
data marts. big volume data had
to be stored and
processed rapidly,
often on massively
parallel servers
running hadoop.
Data was Data was often Data is being both internally
internally externally sourced. and externally sourced.
sourced.
Relational Database In ,memory analytics, in
databases applications, Hadoopo database processing, agile
clusters, SQL to analytical methods, Machine
hadoop environments learning techniques etc ..
etc..

Q) What are the advantages of Big Data Analytics?

 Business Transformation In general, executives believe that big data


analytics offers tremendous potential to revolution their organizations.
 Competitive Advantage According survey 57 percent of enterprises
said their use of analytics was helping them achieve competitive
advantage, up from 51 percent who said the same thing in 2015.
 Innovation Big data analytics can help companies develop products
and services that appeal to their customers, as well as helping them
identify new opportunities for revenue generation.
 Lower Costs In the New Vantage Partners Big Data Executive Survey
2017, 49.2 percent of companies surveyed said that they had
successfully decreased expenses as a result of a big data project.
 Improved Customer Service Organizations often use big data
analytics to examine social media, customer service, sales and
marketing data. This can help them better gauge customer sentiment
and respond to customers in real time.
 Increased Security Another key area for big data analytics is IT
security. Security software creates an enormous amount of log data.

Q) List what Big Data Analytics is not?

Big Data Analtics coexist with both RDBMS and Data Warehouse, leveraging
the power of each to yield business value.
Big Data Analtics isn’t:
 Only about volume
 Just about technology
 Meant to replace RDBMS
 Meant to replace data warehouse
 Only used by huge online companies like Google or Amazon
 “One-size fit all” traditionaly RDBMS built on shared disk and
memory.
Q) Explain different Big Data Analytics Approaches.

Reactive – Business Intelligence: It is about analysis of the pas or


historical data and then displaying the finding of the analysis or reports in
the form of enterprise dash boards, alerts, notifications etc.
Reactive – BigData Analytics: Here the analysis is done on huge datasets
but the approach is still reactive as it is still base don static data.
Proactive – Analytics: This is to support futuristic decision making by the
use of data mining, predictive modeling, text mining and statistical analysis.
This analysis is not on bigdata as it still used traditional data base
management practices.
Proactive – Big Data Analytics: This is sieving through terabytes of
information to filter out the relevant data to analyze. This also includes high
performance analytics to gain rapid insights from big data and the ability to
sovle complex problems using more data.

Q) Explain the following terminology of Big Data


a. In-Memory Analytics
b. In-Database processing
c. Symmetric Mulit-processor system
d. Massively parallel processing
e. Shared nothing architecture
f. CAP Theorem

In-memory Analytics: Data access from non-volatile storage such as hard


disk is a slow process. This problem has been addressed using In-memory
Analytics. Here all the relevant data is stored in Random Access memory
(RAM) or primary storage thus eliminating the need to access the data from
hard disk. The advantage is faster access rapid deployment, better insights,
and minimal IT involvement.

In-Database Processing: In-Database processing is also called In-database


analytics. It works by fusing data warehouses with analytical systems.
Typically the data from various enterprise OLTP systems after cleaning up
through the process of ETL is stored in the Enterprise Dataware house or
data marts. The huge data sets are then exported to analytical programs for
complex and extensive computations.

Symmetric Multi-Processor System:


In this there is single common main memory that is shared by two or more
identical processors. The processors have full access to all I/O devices and
are controlled by single operating system instance.
SMP are tightly coupled multiprocessor systems. Each processor has its
own high speed memory called cache memory and are connected using a
system bus.
Fig. Symmetric Multiprocessor System(SMP)

Massively Parallel Processing:


Massively parallel Processing (MPP) refers to the coordinated processing of
programs by a number of processors working parallel. The processors each
have their own OS and dedicated memory. They work on different parts of
the same program. The MPP processors communicate using some sort of
messaging interface.
MPP is different from symmetric multiprocessing in that SMP works with
processors sharing the same OS and same memory. SMP also referred as
tightly coupled Multiprocessing.

Fig. Distributed Computing an Parallel Computing Environments

Shared nothing Architecture: The three most common types of


architecture for multiprocessor systems:
1. Shared memory
2. Shared disk
3. Shared nothing.
In shared memory architecture, a common central memory is shared by
multiple processors.
In shared disk architecture, multiple processors share a common
collection of disks while having their own private memory.
In shared nothing architecture, neither memory nor disk is shared
among multiple processors.

Advantages of shared nothing architecture:


Fault Isolation: A “shared nothing architecture” provides the benefit of
isolating fault. A fault in a single node is contained and confined to that
node exclusively and exposed only through messages or lack of it.
Scalability: Assume that the disk is a shared resource it implies that the
controller and the disk band-width are also shared. Synchronization will
have to be implemented to maintain a consistent shared state. This would
mean that different nodes will have to take turns to access the critical data.
This imposes a limit on how many nodes can be added to the distributed
shared disk system, thus compromising on the scalability.

CAP Theorem: The CAP theorem is also called the Brewer’s theorem. It
states that in a distributed computing environment, it is impossible to
provide the following guarantees. At best you can have two of the following
three and one must be sacrificed.
1. Consistency
2. Availability
3. Partition tolerance

1. Consistency implies that every read fetches the last write. Consistency
means that all nodes see the same data at the same time. If there are
multiple replicas and there is an update being processed, all users see
the update go live at the same time even if they are reading from
different replicas.
2. Availability implies that reads and writes always succeed. Availability
is a guarantee that every request receives a response about whether it
was successful or failed.
3. Partition tolerance implies that the system will continue to function
when network partition occurs. It means that the system continues to
operate despite arbitrary message loss or failure of part of the system.

Fig. Databases and CAP

Q) What is BASE?

Basically Available, Soft State, Eventual Consistency (BASE) is a data


system design philosophy that in distributed environment, it gives
importance to availability over consistency of operations.
BASE may be explained in contrast to another design philosophy -
Atomicity, Consistency, Isolation, and Durability (ACID). The ACID model
promotes consistency over availability, whereas BASE promotes availability
over consistency.
Q) Give Real-time applications of Big Data Analytics.

1. Banking and Securities Industry:

 This industry also heavily relies on Big Data for risk analytics,
including; anti-money laundering, demand enterprise risk
management, "Know Your Customer," and fraud mitigation.
 The Securities Exchange Commission (SEC) is using Big Data to
monitor financial market activity. They are currently using network
analytics and natural language processors to catch illegal trading
activity in the financial markets.

2. Communications, Media and Entertainment Industry:


Organizations in this industry simultaneously analyze customer data along
with behavioral data to create detailed customer profiles that can be used to:
 Create content for different target audiences
 Recommend content on demand
 Measure content performance
Eg.
 Spotify, an on-demand music service, uses Hadoop Big Data
analytics, to collect data from its millions of users worldwide and then
uses the analyzed data to give informed music recommendations to
individual users.
 Amazon Prime, which is driven to provide a great customer
experience by offering video, music, and Kindle books in a one-stop-
shop, also heavily utilizes Big Data.

3. Healthcare Sector:
 Some hospitals, like Beth Israel, are using data collected from a cell
phone app, from millions of patients, to allow doctors to use evidence-
based medicine as opposed to administering several medical/lab tests
to all patients who go to the hospital.
 Free public health data and Google Maps have been used by the
University of Florida to create visual data that allows for faster
identification and efficient analysis of healthcare information, used in
tracking the spread of chronic disease.

4. Education:
 The University of Tasmania, An Australian university with students
has deployed a Learning and Management System that tracks, among
other things, when a student logs onto the system, how much time is
spent on different pages in the system, as well as the overall progress
of a student over time.
 On a governmental level, the Office of Educational Technology in the
U. S. Department of Education is using Big Data to develop analytics
to help correct course students who are going astray while
using online Big Data certification courses. Click patterns are also
being used to detect boredom.

5. Government:
 In public services, Big Data has an extensive range of applications,
including energy exploration, financial market analysis, fraud
detection, health-related research, and environmental protection.
 The Food and Drug Administration (FDA) is using Big Data to detect
and study patterns of food-related illnesses and diseases.

6. Insurance Industry:
Big data has been used in the industry to provide customer insights for
transparent and simpler products, by analyzing and predicting customer
behavior through data derived from social media, GPS-enabled devices, and
CCTV footage. The Big Data also allows for better customer retention from
insurance companies.

7. Transportation Industry:
Some applications of Big Data by governments, private organizations, and
individuals include:
 Governments use of Big Data: traffic control, route planning,
intelligent transport systems, congestion management (by predicting
traffic conditions)
 Private-sector use of Big Data in transport: revenue management,
technological enhancements, logistics and for competitive advantage
(by consolidating shipments and optimizing freight movement)

8. Energy and Utility Industry:


Smart meter readers allow data to be collected almost every 15 minutes as
opposed to once a day with the old meter readers. This granular data is
being used to analyze the consumption of utilities better, which allows for
improved customer feedback and better control of utilities use.

Q) What is NoSQL? What is the need of NoSQL? Explain different types


of NoSQL databases.

NoSQL Stands for Not Only SQL. These are non-relational, open source,
distributed databases.

Features of NoSQL:
1. NoSQL databases are non-relational: They do not adhere to relational
data model. In fact either key-value pairs or document oriented or
column oriented or graph based databases.
2. Distributed: The data is distributed across several nodes in a cluster
constituted of low commodity hardware.
3. No Support for ACID properties: They do not offer support for ACID
properties of transactions. On the contrary, they adherence to CAP
theorem.
4. No fixed table schema: NoSQL databases are becoming increasing
popular owing to their support for flexibility to the schema. They do
not mandate for the data to strict adhere to any schema structure at
the time of storage.

Need of NoSQL:

1. It has scale out architecture instead of the monolithic architecture of


relational databases.
2. It can house large volumes of structured, semi-structured and
unstructured data.
3. Dynamic Schema: It allows insertion of data without a predefined
schema.
4. Auto Sharding: It automatically spread data across an arbitrary
numer of servers or nodes in a cluster.
5. Replication: It offers good support for replication which in turn
guarantees high availability, fault tolerance and disaster recovery.

Types of NoSQL databases: They broadly divided into Key-Value or big hash
table and Schemal-less.

1. Key-Value: It maintains a big hash table of keys and values.


Key are unique.
It is fast, scalable and fault tolerance.
It can’t model more complex data structure such as objects
Eg. Dynamo, Redis, Riak etc.
Sample Key-Value pair database:
-------------------------------
Key Value
Fname Praneeth
Lname Ch
---------------------------------

2. Document: It maintains data in collections constituted of documents.


Eg. MongoDB, Apace CouchDB, Couchbase, MarkLogic etc.
Sample Document in Document DB:
{
“Book Name”: ”Big Data and Analytics”,
“Publisher”: “Wiley India”,
“Year”: “2015”
}

3. Column: Each storage block has data from only one column. It only
fetch column families of those columns that are required by a query
(all columns in a column family are stored together on the disk, so
multiple rows can be retrieved in one read operation à data locality
Eg. Cassandra, HBase etc.

Sample column database:


UserProfile = {
Cassandra = { emailAddress:”[email protected]” , age:”20”}
TerryCho = { emailAddress:”[email protected]” , gender:”male”}
Cath = { emailAddress:”[email protected]” ,
age:”20”,gender:”female”,address:”Seoul”}
}

4. Graph: They are also called Network database. A graph stores data in
nodes.
Data model:
o (Property Graph) nodes and edges
 Nodes may have properties (including ID)
 Edges may have labels or roles
o Key-value pairs on both
Eg. Neo4j, HyperGraphDB, InfiniteGraph etc.
Sample Graph database:
Fig. Sample Graph Database

Q) What are the advantages and disadvantage of NoSQL?

Advantages:
 Big Data Capability
 No Single Point of Failure
 Easy Replication
 It provides fast performance and horizontal scalability.
 Can handle structured, semi-structured, and unstructured data with
equal effect
 NoSQL databases don't need a dedicated high-performance server
 It can serve as the primary data source for online applications.
 Excels at distributed database and multi-data centre operations
 Eliminates the need for a specific caching layer to store data
 Offers a flexible schema design which can easily be altered without
downtime or service disruption

Disadvantages:
 Limited query capabilities
 RDBMS databases and tools are comparatively mature
 It does not offer any traditional database capabilities, like consistency
when multiple transactions are performed simultaneously.
 When the volume of data increases it is difficult to maintain unique
values as keys become difficult
 Doesn't work as well with relational data
 Open source options so not so popular for enterprises.
 No support for join and group-by operations.

Q) Differentiate SQL and NoSQL.

SQL NoSQL
Relational database Non-relational, distributed database
Relational model Model-less approach
Pre-defined schema Dynamic schema for unstructured data
Table based databases Document-based or graph-based or wide
column store or
key-value pairs databases
Vertically scalable (by Horizontally scalable (by creating a cluster
increasing system commodity machines)
resources)
Uses SQL Uses UnQL (Unstructured Query
Language)
Not preferred for large Largely preferred for large datasets
datasets
Not a best fit for hierarchical Best fit for hierarchical storage as it
data follows the key-
value pair of storing data similar to JSON
(Java Script Object Notation)
Emphasis on ACID properties Follows Brewer’s CAP theorem
Excellent support from Relies heavily on community support
vendors
Supports complex Does not have good support for complex
querying and querying
data
keeping needs
Can be configured for strong Few support strong consistency (e.g.,
consistency MongoDB), few
others can be configured for eventual
consistency (e.g.,
Cassandra)
Examples: Oracle, DB2, MongoDB, HBase, Cassandra, Redis,
MySQL, MS SQL, Neo4j, CouchDB,
PostgreSQL, etc. Couchbase, Riak, etc.

Q) Explain where to use NoSQL? Explain some real time applications of


NoSQL.

Key-Value
Shopping carts
Web user data analysis
Amazon, Linkedin
Document based
Real-time Analysis
Logging
Document archive management
Column-oriented
Analyze huge web user actions
Sensor feeds
Facebook, Twitter, eBay, Netfix
Graph-based
Network modeling
Recommendation
Walmart-upsell, cross-sell

Real time applications of NoSQL in BigData Analytics:


 HBase for Hadoop, a popular NoSQL database is used extensively by
Facebook for its messaging infrastructure.
 HBase is used by Twitter for generating data, storing, logging, and
monitoring data around people search.
 HBase is used by the discovery engine Stumble upon for data
analytics and storage.
 MongoDB is another NoSQL Database used by CERN, a European
Nuclear Research Organization for collecting data from the huge
particle collider “Hadron Collider”.
 LinkedIn, Orbitz, and Concur use the Couchbase NoSQL Database for
various data processing and monitoring tasks.

Q) What is NewSQL? Differentiate SQL, NoSQL and NewSQL

NewSQL supports relational data model and uses SQL as their primary
interface.
NewSQL Characterisitcs:
 SQL interface for application interaction
 ACID support for transactions
 An architecture that provides higher per node performance vis-a-vs
traditional RDBMS solution
 Scale out, shared nothing architecture
 Non-locking concurrency control mechanism so that real time reads
will not conflict with writes.

SQL NoSQL NewSQL

Adherence to Yes ACID No Yes


properties

OLTP/OLAP Yes No Yes

Schema rigidity Yes No Maybe


Adherence to data model Adherence to relational model

Data Format Flexibility No Yes Maybe

Scalability Scale up Scale out Scale


Vertical Scaling Horizontal out
Scaling
Distributed Computing Yes Yes Yes

Community Support Huge Growing Slowly


growing
UNIT – III
Introduction to Hadoop and MapReduce Programming
Hadoop Overview, HDFS (Hadoop Distributed File System), Processing–
Data with Hadoop, Managing Resources and Applications with Hadoop
YARN (Yet another Resource Negotiator).
Introduction to MAPREDUCE Programming: Introduction, Mapper,
Reducer, Combiner, Partitioner, Searching, Sorting, Compression.

Q) Explain the differences between Hadoop and RDBMS


Parameters RDBMS Hadoop
System Relational Database Node based flat structure
Management system
Data Suitable for structured data Suitable for Structured,
unstructured data, supports
variety of formats(xml, json)
Processing OLTP Analytical, big data processing
Hadoop clusters, node require
any consistent relationships
between data
Choice When the data needs Big data processing, which
consistent relationship does not require any
consistent relationships
between data
Processor Needs expensive hardware or In commodity hardware less
high-end processors to store configure hardware.
huge volumes of data
Cost Cost around $10,000 to Cost around $4000 per
$14,000 per terabytes of terabytes of storage.
storage

Q) What is Hadoop? Explain features of hadoop.


 Hadoop is an open source framework that is meant for storage and
processing of big data in a distributed manner.
 It is the best solution for handling big data challenges.

Some important features of Hadoop are –


Open Source – Hadoop is an open source framework which means it
is available free of cost. Also, the users are allowed to change the
source code as per their requirements.
Distributed Processing – Hadoop supports distributed processing of
data i.e. faster processing. The data in Hadoop HDFS is stored in a
distributed manner and MapReduce is responsible for the parallel
processing of data.
Fault Tolerance – Hadoop is highly fault-tolerant. It creates three
replicas for each block (default) at different nodes.
Reliability – Hadoop stores data on the cluster in a reliable manner
that is independent of machine. So, the data stored in Hadoop
environment is not affected by the failure of the machine.
Scalability – It is compatible with the other hardware and we can
easily add/remove the new hardware to the nodes.
High Availability – The data stored in Hadoop is available to access
even after the hardware failure. In case of hardware failure, the data
can be accessed from another node.
The core components of Hadoop are –
1. HDFS: (Hadoop Distributed File System) – HDFS is the basic
storage system of Hadoop. The large data files running on a cluster
of commodity hardware are stored in HDFS. It can store data in a
reliable manner even when hardware fails. The key aspects of
HDFS are:
a. Storage component
b. Distributes data across several nodes
c. Natively redundant.
2. Map Reduce: MapReduce is the Hadoop layer that is responsible
for data processing. It writes an application to process
unstructured and structured data stored in HDFS.
It is responsible for the parallel processing of high volume of data
by dividing data into independent tasks. The processing is done in
two phases Map and Reduce.
The Map is the first phase of processing that specifies complex
logic code and the
Reduce is the second phase of processing that specifies light-
weight operations.
The key aspects of Map Reduce are:
a. Computational frame work
b. Splits a task across multiple nodes
c. Processes data in parallel

Q) Explain Hadoop Architecture with a neat sketch.

Fig. Hadoop Architecture


Hadoop Architecture is a distributed Master-slave architecture.
Master HDFS: Its main responsibility is partitioning the data storage
across the slave nodes. It also keep track of locations of data on
Datanodes.
Master Map Reduce: It decides and schedules computation task on
slave nodes.
NOTE: Based on marks for the question explain hdfs daemons and
mapreduce daemons.

Q) Explain the following


a) Modules of Apache Hadoop framework
There are four basic or core components:
Hadoop Common: It is a set of common utilities and libraries which handle
other Hadoop modules. It makes sure that the hardware failures are
managed by Hadoop cluster automatically.
Hadoop YARN: It allocates resources which in turn allow different users to
execute various applications without worrying about the increased
workloads.
HDFS: It is a Hadoop Distributed File System that stores data in the form of
small memory blocks and distributes them across the cluster. Each data is
replicated multiple times to ensure data availability.
Hadoop MapReduce: It executes tasks in a parallel fashion by distributing
the data as small blocks.

b) Hadoop Modes of Installations


i. Standalone, or local mode: which is one of the least commonly
used environments, which only for running and debugging of
MapReduce programs. This mode does not use HDFS nor it
launches any of the hadoop daemon.
ii. Pseudo-distributed mode(Cluster of One), which runs all
daemons on single machine. It is most commonly used in
development environments.
iii. Fully distributed mode, which is most commonly used in
production environments. This mode runs all daemons on a
cluster of machines rather than single one.

c) XML File configrations in Hadoop.

core-site.xml – This configuration file contains Hadoop core


configuration settings, for example, I/O settings, very common
for MapReduce and HDFS.
mapred-site.xml – This configuration file specifies a framework
name for MapReduce by setting mapreduce.framework.name
hdfs-site.xml – This configuration file contains HDFS daemons
configuration settings. It also specifies default block permission
and replication checking on HDFS.
yarn-site.xml – This configuration file specifies configuration
settings for ResourceManager and NodeManager.
Q) Explain features of HDFS. Discuss the design of Hadoop
distributed file system and concept in detail.

HDFS: (Hadoop Distributed File System) – HDFS is the basic storage


system of Hadoop. The large data files running on a cluster of commodity
hardware are stored in HDFS. It can store data in a reliable manner even
when hardware fails. The key aspects of HDFS are:
 HDFS is developed by the inspiration of Google File System(GFS).
 Storage component: Stores data in hadoop
 Distributes data across several nodes: divides large file into blocks
and stores in various data nodes.
 Natively redundant: replicates the blocks in various data nodes.
 High Throughput Access: Provides access to data blocks which are
nearer to the client.
 Re-replicates the nodes when nodes are failed.

Fig. Features of HDFS


HDFS Daemons:
(i) NameNode
 The NameNode is the master of HDFS that directs the slave
DataNodes to perform I/O tasks.
 Blocks: HDFS breaks large file into smaller pieces called blocks.
 rackID: NameNode uses rackID to identify data nodes in the
rack. (rack is a collection of datanodes with in the cluster)
NameNode keep track of blocks of a file.
 File System Namespace: NameNode is the book keeper of
HDFS. It keeps track of how files are broken down into blocks
and which DataNode stores these blocks. It is a collection of
files in the cluster.
 FsImage: file system namespace includes mapping of blocks of
a file, file properties and is stored in a file called FsImage.
 EditLog: namenode uses an EditLog (transaction log) to record
every transaction that happens to the file system metadata.
 NameNode is single point of failure of Hadoop cluster.
HDFS KEY POINTS
BLOCK DEFAULT DEFAULT BLOCK
STRUCTURED FILE REPLICATION SIZE: 64MB/128MB
FACTOR: 3

Fig. HDFS Architecture


(ii) DataNode
 Multiple data nodes per cluster. Each slave machine in the
cluster have DataNode daemon for reading and writing HDFS
blocks of actual file on local file system.
 During pipeline read and write DataNodes communicate with
each other.
 It also continuously Sends “heartbeat” message to NameNode
to ensure the connectivity between the Name node and the data
node.
 If no heartbeat is received for a period of time NameNode
assumes that the DataNode had failed and it is re-replicated.

Fig. Interaction between NameNode and DataNode.


(iii)Secondary name node
 Takes snapshot of HDFS meta data at intervals specified in the
hadoop configuration.
 Memory is same for secondary node as NameNode.
 But secondary node will run on a different machine.
 In case of name node failure secondary name node can be
configured manually to bring up the cluster i.e; we make
secondary namenode as name node.

File Read operation:


The steps involved in the File Read are as follows:
1. The client opens the file that it wishes to read from by calling open()
on the DFS.
2. The DFS communicates with the NameNode to get the location of data
blocks. NameNode returns with the addresses of the DataNodes that
the data blocks are stored on.
Subsequent to this, the DFS returns an FSD to client to read from the
file.
3. Client then calls read() on the stream DFSInputStream, which has
addresses of DataNodes for the first few block of the file.
4. Client calls read() repeatedly to stream the data from the DataNode.
5. When the end of the block is reached, DFSInputStream closes the
connection with the DataNode. It repeats the steps to find the best
DataNode for the next block and subsequent blocks.
6. When the client completes the reading of the file, it calls close() on the
FSInputStream to the connection.

Fig. File Read Anatomy

File Write operation:


1. The client calls create() on DistributedFileSystem to create a file.
2. An RPC call to the namenode happens through the DFS to create a
new file.
3. As the client writes data, data is split into packets by
DFSOutputStream, which is then writes to an internal queue, called
data queue. Datastreamer consumes the data queue.
4. Data streamer streams the packets to the first DataNode in the
pipeline. It stores packet and forwards it to the second DataNode in
the pipeline.
5. In addition to the internal queue, DFSOutputStream also manages on
“Ackqueue” of the packets that are waiting for acknowledged by
DataNodes.
6. When the client finishes writing the file, it calls close() on the stream.

Fig. File Write Anatomy

Special features of HDFS:


1. Data Replication: There is absolutely no need for a client application
to track all blocks. It directs client to the nearest replica to ensure
high performance.
2. Data Pipeline: A client application writes a block to the first
DataNode in the pipeline. Then this DataNode takes over and forwards
the data to the next node in the pipeline. This process continues for
all the data blocks, and subsequently all the replicas are written to
the disk.

Fig. File Replacement Strategy


Q) Explain basic HDFS File operations with an example.

1. Creating a directory:
Syntax: hdfs dfs –mkdir <path>
Eg. hdfs dfs –mkdir /chp

2. Remove a file in specified path:


Syntax: hdfs dfs –rm <src>
Eg. hdfs dfs –rm /chp/abc.txt

3. Copy file from local file system to hdfs:


Syntax: hdfs dfs –copyFromLocal <src> <dst>
Eg. hdfs dfs –copyFromLocal /home/hadoop/sample.txt
/chp/abc1.txt

4. To display list of contents in a directory:


Syntax: hdfs dfs –ls <path>
Eg. hdfs dfs –ls /chp

5. To display contents in a file:


Syntax: hdfs dfs –cat <path>
Eg. hdfs dfs –cat /chp/abc1.txt
6. Copy file from hdfs to local file system:
Syntax: hdfs dfs –copyToLocal <src <dst>
Eg. hdfs dfs –copyToLocal /chp/abc1.txt
/home/hadoop/Desktop/sample.txt

7. To display last few lines of a file:


Syntax: hdfs dfs –tail <path>
Eg. hdfs dfs –tail /chp/abc1.txt

8. Display aggregate length of file in bytes:


Syntax: hdfs dfs –du <path>
Eg. hdfs dfs –du /chp

9. To count no.of directories, files and bytes under given path:


Syntax: hdfs dfs –count <path>
Eg. hdfs dfs –count /chp
o/p: 1 1 60

10. Remove a directory from hdfs


Syntax: hdfs dfs –rmr <path>
Eg. hdfs dfs rmr /chp
Q) Explain the importance of MapReduce in Hadoop environment for
processing data.
 MapReduce programming helps to process massive amounts of
data in parallel.
 Input data set splits into independent chunks. Map tasks
process these independent chunks completely in a parallel
manner.
 Reduce task-provides reduced output by combining the output
of various mapers. There are two daemons associated with
MapReduce Programming: JobTracker and TaskTracer.
JobTracker:
JobTracker is a master daemon responsible for executing over
MapReduce job.
It provides connectivity between Hadoop and application.

Whenever code submitted to a cluster, JobTracker creates the


execution plan by deciding which task to assign to which node.

It also monitors all the running tasks. When task fails it automatically
re-schedules the task to a different node after a predefined number of
retires.

There will be one job Tracker process running on a single Hadoop


cluster. Job Tracker processes run on their own Java Virtual machine
process.

Fig. Job Tracker and Task Tracker interaction

TaskTracker:
This daemon is responsible for executing individual tasks that is
assigned by the Job Tracker.

Task Tracker continuously sends heartbeat message to job tracker.


When a job tracker fails to receive a heartbeat message from a
TaskTracker, the JobTracker assumes that the TaskTracker has failed
and resubmits the task to another available node in the cluster.

Map Reduce Framework


Phases: Daemons:
Map: Converts input into key- JobTracker: Master, Schedules
value pairs. Task
Reduce: Combines output of TaskTracker: Slave, Execute task
mappers and produces a reduced
result set.

MapReduce working:
MapReduce divides a data analysis task into two parts – Map and
Reduce. In the example given below: there two mappers and one
reduce.
Each mapper works on the partial data set that is stored on that node
and the reducer combines the output from the mappers to produce
the reduced result set.
Steps:
1. First, the input dataset is split into multiple pieces of data.
2. Next, the framework creates a master and several slave processes
and executes the worker processes remotely.
3. Several map tasks work simultaneously and read pieces of data
that were assigned to each map task.
4. Map worker uses partitioner function to divide the data into
regions.
5. When the map slaves complete their work, the master instructs the
reduce slaves to begin their work.
6. When all the reduce slaves complete their work, the master
transfers the control to the user program.

Fig. MapReduce Programming Architecture


A MapReduce programming using Java requires three classes:
1. Driver Class: This class specifies Job configuration details.
2. MapperClass: this class overrides the MapFunction based on the
problem statement.
3. Reducer Class: This class overrides the Reduce function based on the
problem statement.
NOTE: Based on marks given write MapReduce example if necessary with
program.

Q) Explain difference between Hadoop1X and Hadoop2X

Limitations of Hadoop 1.0: HDFS and MapReduce are core components,


while other components are built around the core.
1. Single namenode is responsible for entire namespace.
2. It is Restricted processing model which is suitable for batch-oriented
mapreduce jobs.
3. Not supported for interactive analysis.
4. Not suitable for Machine learning algorithms, graphs, and other
memory intensive algorithms
5. MapReduce is responsible for cluster resource management and data
Processing.
HDFS Limitation: The NameNode can quickly become overwhelmed with
load on the system increasing. In Hadoop 2.x this problem is resolved.

Hadoop 2: Hadoop 2.x is YARN based architecture. It is general processing


platform. YARN is not constrained to MapReduce only. One can run multiple
applications in Hadoop 2.x in which all applications share common resource
management.

Hadoop 2.x can be used for various types of processing such as Batch,
Interactive, Online, Streaming, Graph and others.

HDFS 2 consists of two major components


a) NameSpace: Takes care of file related operations such as creating
files, modifying files and directories
b) Block storage service: It handles data node cluster management
and replication.

HDFS 2 Features:
Horizontal scalability: HDFS Federation uses multiple independent
NameNodes for horizontal scalability. The DataNodes are common storage
for blocks and shared by all NameNodes. All DataNodes in the cluster
registers with each NameNode in the cluster.
High availability: High availability of NameNode is obtained with the help
of Passive Standby NameNode.

Active-Passive NameNode handles failover automatically. All namespace


edits are recorded to a shared NFS(Network File Storage) Storage and there
is a single writer at an point of time.
Passive NameNode reads edits from shared storage and keeps updated
metadata information.
Incase of Active NameNode failure, Passive NameNode becomes an Active
NameNode automatically. Then it starts writing to the shared storage.

Active Passive
NameNode NameNode
Shared Edit
Logs
Write Read

Fig. Active and Passive NameNode Interaction

Fig. Comparing Hadoop1.0 and Hadoop 2.0


Hadoop1X Hadoop2X
1 Supports MapReduce (MR) Allows working in MR as well as
processing model only. Does other distributed computing models
not support non-MR tools like Spark, & HBase coprocessors.
2 MR does both processing and YARN does cluster resource
cluster-resource management. management and processing is
done using different processing
models.
3 Has limited scaling of nodes. Has better scalability. Scalable up
Limited to 4000 nodes per to 10000 nodes per cluster
cluster
4 Works on concepts of slots – Works on concepts of containers.
slots can run either a Map task Using containers can run generic
or a Reduce task only. tasks.
5 A single Namenode to manage Multiple Namenode servers manage
the entire namespace. multiple namespaces.
6 Has Single-Point-of-Failure Has to feature to overcome SPOF
(SPOF) – because of single with a standby Namenode and in
Namenode. the case of Namenode failure, it is
configured for automatic recovery.
7 MR API is compatible with MR API requires additional files for
Hadoop1x. A program written a program written in Hadoop1x to
in Hadoop1 executes execute in Hadoop2x.
in Hadoop1x without any
additional files.
8 Has a limitation to serve as a Can serve as a platform for a wide
platform for event processing, variety of data analytics-possible to
streaming and real-time run event processing, streaming
operations. and real-time operations.
9 Does not support Microsoft Added support for Microsoft
Windows windows

Q) Explain in detail about YARN?

The fundamental idea behind the YARN(Yet Another Resource Negotiator)


architecture is to splitting the JobTracker reponsibility of resource
management and job scheduling/monitoring into separate daemons.

Basic concepts of YARN are Application and Container.


Application is a job submitted to system.
Ex: MapReduce job.
Container: Basic unit of allocation. Replaces fixed map/reduce slots. Fine-
grained resource allocation across multiple resource type
Eg. Container_0: 2GB, 1CPU
Container_1: 1GB, 6CPU

Daemons that are part of YARN architecture are:


1. Global Resource Manager: The main responsibility of Global Resource
Manager is to distribute resources among various applications.
It has two main components:
Scheduler: The pluggable scheduler of ResourceManager decides
allocation of resources to various running applications. The scheduler is just
that, a pure scheduler, meaning it does NOT monitor or track the status of
the application.
Application Manager: It does:
o Accepting job submissions.
o Negotiating resources(container) for executing the
application specific ApplicationMaster
o Restarting the ApplicationMaster in case of failure

2. NodeManager:
o This is a per-machine slave daemon. NodeManager
responsibility is launching the application containers for
application execution.
o NodeManager monitors the resource usage such as memory,
CPU, disk, network, etc.
o It then reports the usage of resources to the global
ResourceManager.

3. Per-Application Application Master: Per-application Application


master is an application specific entity. It’s responsibility is to
negotiate required resources for execution from the ResourceManager.
It works along with the NodeManager for executing and monitoring
component tasks.

Fig. YARN Architecture

The steps involved in YARN architecture are:


1. The client program submits an application.
2. The Resource Manager launches the Application Master by assigning
some container.
3. The Application Master registers with the Resource manager.
4. On successful container allocations, the application master launches
the container by providing the container launch specification to the
NodeManager.
5. The NodeManager executes the application code.
6. During the application execution, the client that submitted the job
directly communicates with the Application Master to get status,
progress updates.
7. Once the application has been processed completely, the application
master deregisters with the ResourceManager and shuts down
allowing its own container to be repurposed.

Q) Explain Hadoop Ecosystem in detail.

The following are the components of Hadoop ecosystem:


1. HDFS: Hadoop Distributed File System. It simply stores data files as
close to the original form as possible.
2. HBase: It is Hadoop’s distributed column based database. It supports
structured data storage for large tables.
3. Hive: It is a Hadoop’s data warehouse, enables analysis of large data
sets using a language very similar to SQL. So, one can access data
stored in hadoop cluster by using Hive.
4. Pig: Pig is an easy to understand data flow language. It helps with the
analysis of large data sets which is quite the order with Hadoop
without writing codes in MapReduce paradigm.
5. ZooKeeper: It is an open source application that configures
synchronizes the distributed systems.
6. Oozie: It is a workflow scheduler system to manage apache hadoop
jobs.
7. Mahout: It is a scalable Machine Learning and data mining library.
8. Chukwa: It is a data collection system for managing large distributed
systems.
9. Sqoop: it is used to transfer bulk data between Hadoop and
structured data stores such as relational databases.
10. Ambari: it is a web based tool for provisioning, Managing
and Monitoring Apache Hadoop clusters.

Q) Describe differences between SQL and MapReduce


Characteristic SQL MapReduce(Hadoop1X)
Access Interactive and Batch
Batch
Structure Static Dynamic
Updates Read and Write Write once, Read many
many times times
Integrity High Low
Scalability Nonlinear Linear

Q) What is MapReduce. Explain indetail different phases in


MapReduce. (or) Explain MapReduce anatomy.

MapReduce is a programming model for data processing. Hadoop can run


MapReduce programs written in Java, Ruby and Python.
MapReduce programs are inherently parallel, thus very large scale data
analysis can be done fastly.
In MapReduce programming, Jobs(applications) are split into a set of map
tasks and reduce tasks.
Map task takes care of loading, parsing, transforming and filtering.
The responsibility of reduce task is grouping and aggregating data that is
produced by map tasks to generate final output.
Each map task is broken down into the following phases:
1. Record Reader 2. Mapper
3. Combiner 4.Partitioner.
The output produced by the map task is known as intermediate <keys,
value> pairs. These intermediate <keys, value> pairs are sent to reducer.
The reduce tasks are broken down into the following phases:
1. Shuffle 2. Sort
3. Reducer 4. Output format.
Hadoop assigns map tasks to the DataNode where the actual data to be
processed resides. This way, Hadoop ensures data locality. Data locality
means that data is not moved over network; only computational code moved
to process data which saves network bandwidth.

Mapper Phases:
Mapper maps the input <keys, value> pairs into a set of intermediate <keys,
value> pairs.
Each map task is broken into following phases:

1. RecordReader: converts byte oriented view of input in to Record


oriented view and presents it to the Mapper tasks. It presents the
tasks with keys and values.
i) InputFormat: It reads the given input file and splits using the
method getsplits().
ii) Then it defines RecordReader using createRecordReader()
which is responsible for generating <keys, value> pairs.

2. Mapper: Map function works on the <keys, value> pairs produced by


RecordReader and generates intermediate (key, value) pairs.
Methods:
- protected void cleanup(Context context): called once at tend of
task.
- protected void map(KEYIN key, VALUEIN value, Context
context): called once for each key-value pair in input split.
- void run(Context context): user can override this method for
complete control over execution of Mapper.
- protected void setup(Context context): called once at
beginning of task to perform required activities to initiate map()
method.

3. Combiner: It takes intermediate <keys, value> pairs provided by


mapper and applies user specific aggregate function to only one
mapper. It is also known as local Reducer.
We can optionally specify a combiner using
Job.setCombinerClass(ReducerClass) to perform local aggregation on
intermediate outputs.

Fig. MapReduce without Combiner class

Fig. MapReduce with Combiner class

4. Partitioner: Take intermediate <keys, value> pairs produced by the


mapper, splits them into partitions the data using a user-defined
condition.
The default behavior is to hash the key to determine the reducer.User
can control by using the method:
int getPartition(KEY key, VALUE value, int numPartitions )
Reducer Phases:
1. Shuffle & Sort:
 Downloads the grouped key-value pairs onto the local machine,
where the Reducer is running.
 The individual <keys, value> pairs are sorted by key into a
larger data list.
 The data list groups the equivalent keys together so that their
values can be iterated easily in the Reducer task.
2. Reducer:
 The Reducer takes the grouped key-value paired data as input
and runs a Reducer function on each one of them.
 Here, the data can be aggregated, filtered, and combined in a
number of ways, and it requires a wide range of processing.
 Once the execution is over, it gives zero or more key-value pairs
to the final step.
Methods:
- protected void cleanup(Context context): called once at tend of task.
- protected void reduce(KEYIN key, VALUEIN value, Context
context): called once for each key-value pair.
- void run(Context context): user can override this method for
complete control over execution of Reducer.
- protected void setup(Context context): called once at
beginning of task to perform required activities to initiate reduce()
method.

3. Output format:
 In the output phase, we have an output formatter that
translates the final key-value pairs from the Reducer function
and writes them onto a file using a record writer.
Compression: In MapReduce programming we can compress the output
file. Compression provides two benefits as follows:
 Reduces the space to store files.
 Speeds up data transfer across the network.
We can specify compression format in the Driver program as below:

conf.setBoolean(“mapred.output.compress”,true);
conf.setClass(“mapred.output.compression.codec”,GzipCodec.class,Compres
sionCodec.class);
Here, codec is the implementation of a compression and decompression
algorithm, GzipCodec is the compression algorithm for gzip.

Q) Write a MapReuduce program for WordCount problem.

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount


{
public static class WCMapper extends Mapper <Object, Text, Text,
IntWritable>

{
final static IntWritable one = new IntWritable(1);
Text word = new Text();
public void map(Object key, Text value, Context context) throws
IOException, InterruptedException {

StringTokenizer itr = new tringTokenizer(value.toString());


while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class WCReducer extends Reducer<Text, IntWritable, Text,
IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context ) throws IOException, InterruptedException {

int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WCMapper.class);
job.setReducerClass(WCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Fig. MapReduce paradigm for WordCount


Q) Write a MapReduce program to calculate employee salary of each
department in the university.
I/P:
001,it,10000
002,cse,20000
003,it,30000

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class Salary


{

public static class SalaryMapper extends Mapper <LongWritable,


Text, Text,
IntWritable>
{

public void map(LongWritable key, Text value, Context context) throws


IOException,
InterruptedException
{

String[] token = value.toString().split(",");


int s = Integer.parseInt(token[2]);
IntWritable sal = new IntWritable();
sal.set(s);
context.write(new Text(token[1]),sal);

}
}
public static class SalaryReducer extends Reducer<Text, IntWritable, Text,
IntWritable>
{
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context


context ) throws
IOException, InterruptedException
{

int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key,result);
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Salary");

job.setJarByClass(Salary.class);
job.setMapperClass(SalaryMapper.class);
job.setReducerClass(SalaryReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);
}

Q) Write a user define partitioner class for WordCount problem.

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class WordCountPartitioner extends Partitioner<Text,IntWritable{

public int getPartition(Text key, IntWritable value, int numPartitions){


String word = key.toString();
char alphabet = word.toUpperCase().charAt(0);
int partitionNumber = 0;
switch(alphabet){
case 'A': partitionNumber = 1;break;
case 'B': partitionNumber = 1;break;
case 'C': partitionNumber = 1;break;
case 'D': partitionNumber = 1;break;
case 'E': partitionNumber = 1;break;
case 'F': partitionNumber = 1;break;
case 'G': partitionNumber = 1;break;
case 'H': partitionNumber = 1;break;
case 'I': partitionNumber = 1;break;
case 'J': partitionNumber = 1;break;
case 'K': partitionNumber = 1;break;
case 'L': partitionNumber = 1;break;
case 'M': partitionNumber = 1;break;
case 'N': partitionNumber = 1;break;
case 'O': partitionNumber = 1;break;
case 'P': partitionNumber = 1;break;
case 'Q': partitionNumber = 1;break;
case 'R': partitionNumber = 1;break;
case 'S': partitionNumber = 1;break;
case 'T': partitionNumber = 1;break;
case 'U': partitionNumber = 1;break;
case 'V': partitionNumber = 1;break;
case 'W': partitionNumber = 1;break;
case 'X': partitionNumber = 1;break;
case 'Y': partitionNumber = 1;break;
case 'Z': partitionNumber = 1;break;
default: partitionNumber = 0;break;
}
return partionNumber;
}
}

In the drive program set the partioner class as shown below:


job.setNumReduceTasks(27);
job.setPartitionerClass(WordCountPartitioner.class);

Q) Write a MapReuduce program for sorting following data according to


name.
Input:
001,chp
002,vr
003,pnr
004,prp

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class Sort{


public static class SortMapper extends Mapper
<LongWritable,Text,Text,Text> {
protected void map(LongWritable key, Text value, Context
context) throws IOException,InterruptedException{

String[] token = value.toString().split(",");


context.write(new Text(token[1]),new Text(token[0]+"-"+token[1]));
}
}

public static class SortReducer extends


Reducer<Text,Text,NullWritable,Text>{
public void reduce(Text key, Iterable<Text> values, Context
context) throws IOException,InterruptedException{

for(Text details:values){
context.write(NullWritable.get(),details);

public static void main(String args[]) throws


IOException,InterruptedException,ClassNotFoundException{

Configuration conf = new Configuration();


Job job = new Job(conf);
job.setJarByClass(Sort.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?0:1);
}
}
Q) Write a MapReduce program to arrange the data on user-id, then
within the user id sort them in increasing order of the page count.
Input:
001,3,www.turorialspoint.com
001,4,www.javapoint.com
002,5,www.javapoint.com
003,2,www.analyticsvidhya.com

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Sort{
public static class SortMapper extends
Mapper<LongWritable,Text,Text,Text>{
Text comp_key = new Text();
protected void map(LongWritable key, Text value, Context
context) throws IOException,InterruptedException{

String[] token = value.toString().split(",");


comp_key.set(token[0]);
comp_key.set(token[1]);
context.write(comp_key,new Text(token[0]+"-"+token[1]+"-"+token[2]));
}
}

public static class SortReducer extends


Reducer<Text,Text,NullWritable,Text>{
public void reduce(Text key, Iterable<Text> values, Context
context) throws IOException,InterruptedException{

for(Text details:values){
context.write(NullWritable.get(),details);
}
}
}

public static void main(String args[]) throws


IOException,InterruptedException,ClassNotFoundException{
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(Sort.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?0:1);
}
}
o/p:

Q) Write a MapReduce program to search an employee name in the


following data:
Input:
001,chp
002,vr
003,pnr
004,prp
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

public class Search{


public static class SearchMapper extends Mapper<LongWritable,
Text, Text, Text>{
static String keyword;
static int pos=0;
protected void setup(Context context) throws IOException,
InterruptedException {
Configuration config = context.getConfiguration();
keyword = config.get("keyword");
}
protected void map(LongWritable key, Text value, Context context) throws
IOException,InterruptedException{
InputSplit in = context.getInputSplit();
FileSplit f = (FileSplit)in;
String fileName = f.getPath().getName();
Integer wordPos;
pos++;
if(value.toString().contains(keyword)){
wordPos = value.find(keyword);
context.write(value, new Text(fileName + ","+new
IntWritable(pos).toString()+","+wordPos.toString()));
}
}
}

public static class SearchReducer extends Reducer


<Text,Text,Text,Text>{
public void reduce(Text key, Text value, Context context) throws
IOException,InterruptedException{
context.write(key,value);
}
}
public static void main(String args[]) throws
IOException,InterruptedException,ClassNotFoundException{
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(Search.class);
job.setMapperClass(SearchMapper.class);
job.setReducerClass(SearchReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setNumReduceTasks(1);
job.getConfiguration().set("keyword","chp");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?0:1);
}
}

Q) What are the Real time applications using MapReduce


Programming?
 Social networks  Banking
 Media and Entertainment  Stock Market
 Health Care  Weather Forecasting
 Business

You might also like