0% found this document useful (0 votes)
28 views16 pages

CASE STUDY On Application of Hadoop

The document provides details about a case study on the application of Hadoop. It discusses the history of Hadoop, how it was started by Doug Cutting and Mike Cafarella in 2002 based on the Google File System paper. It then summarizes the key components and modules of Hadoop including HDFS, YARN, and MapReduce. The document outlines how Hadoop works with its distributed architecture and the advantages it provides like being fast, scalable, and cost effective.

Uploaded by

haqueashraful713
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views16 pages

CASE STUDY On Application of Hadoop

The document provides details about a case study on the application of Hadoop. It discusses the history of Hadoop, how it was started by Doug Cutting and Mike Cafarella in 2002 based on the Google File System paper. It then summarizes the key components and modules of Hadoop including HDFS, YARN, and MapReduce. The document outlines how Hadoop works with its distributed architecture and the advantages it provides like being fast, scalable, and cost effective.

Uploaded by

haqueashraful713
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

CASE STUDY ON APPLICATON OF HADOOP

TEAM NAME: - “UDAAN”


Nirbhay Pandey (B.B.A)
Alok Kumar Chaturvedi (B.C.A)
Ashraful Haque (B.B.A)
Pintu kumar (B.B.A)
Saurav raj (B.B.A)

Quantum University,
Mandawara (22 km millstone)
Roorkee Dehradun highway.

History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its
origin was the Google File System paper, published by Google.

Let's focus on the history of Hadoop in the following steps: -


o In 2002, Doug Cutting and Mike Cafarella started to work
on a project, Apache Nutch. It is an open source web
crawler software project.
o While working on Apache Nutch, they were dealing with
big data. To store that data they have to spend a lot of
costs which becomes the consequence of that project. This
problem becomes one of the important reason for the
emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS
(Google file system). It is a proprietary distributed file
system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce.
This technique simplifies the data processing on large
clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a
new file system known as NDFS (Nutch Distributed File
System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On
the basis of the Nutch project, Dough Cutting introduces a
new project Hadoop with a file system known as HDFS
(Hadoop Distributed File System). Hadoop first version
0.1.0 released in this year.
o Doug Cutting gave named his project Hadoop after his
son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1
terabyte of data on a 900 node cluster within 209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.

INTRODUCTION OF HADOOP: -

Hadoop is an open-source distributed processing framework that


manages data processing and storage for big data applications in scalable
clusters of computer servers. It's at the center of an ecosystem of big
data technologies that are primarily used to support advanced analytics
initiatives, including predictive analytics, data mining and machine
learning. Hadoop systems can handle various forms of structured and
unstructured data, giving users more flexibility for collecting, processing,
analysing and managing data than relational databases and data
warehouses provide.
Hadoop's ability to process and store different types of data makes it a
particularly good fit for big data environments. They typically involve not
only large amounts of data, but also a mix of structured transaction
data and semistructured and unstructured information, such as internet
clickstream records, web server and mobile application logs, social media
posts, customer emails and sensor data from the internet of things (IoT).

Formally known as Apache Hadoop, the technology is developed as part of


an open source project within the Apache software foundation. Multiple
vendors offer commercial Hadoop distributions, although the number of
Hadoop vendors has declined because of an overcrowded market and then
competitive pressures driven by the increased deployment of big data
systems in the cloud. The shift to the cloud also enables users to store data
in lower-cost cloud object storage services instead of Hadoop's namesake
file system; as a result, Hadoop's role is being reduced in some big data
architectures.

Hadoop and big data:-


Hadoop runs on commodity servers and can scale up to support
thousands of hardware nodes. The Hadoop Distributed File System
(HDFS) is designed to provide rapid data access across the nodes in a
cluster, plus fault-tolerant capabilities so applications can continue to
run if individual nodes fail. Those features helped Hadoop become a
foundational data management platform for big data analytics uses
after it emerged in the mid-2000s.

Because Hadoop can process and store such a wide assortment of data, it
enables organizations to set up data lakes as expansive reservoirs for
incoming streams of information. In a Hadoop data lake, raw data is often
stored as is so data scientists and other analysts can access the full data
sets, if need be; the data is then filtered and prepared by analytics or IT
teams, as needed, to support different applications.

Data lakes generally serve different purposes than traditional data


warehouses that hold cleansed sets of transaction data. But, in some cases,
companies view their Hadoop data lakes as modern-day data warehouses.
Either way, the growing role of big data analytics in business decision-
making has made effective data governance and data security processes a
priority in data lake deployments and Hadoop systems in general.

Components of Hadoop and how it works:-


The core components in the first iteration of Hadoop were MapReduce,
HDFS and Hadoop Common, a set of shared utilities and libraries. As its
name indicates, MapReduce uses map and reduce functions to split
processing jobs into multiple tasks that run at the cluster nodes where data is
stored and then to combine what the tasks produce into a coherent set of
results. MapReduce initially functioned as both Hadoop's processing engine
and cluster resource manager, which tied HDFS directly to it and limited
users to running MapReduce batch applications.

That changed in Hadoop 2.0, which became generally available in


October 2013 when version 2.2.0 was released. It introduced Apache
Hadoop YARN, a new cluster resource management and job scheduling
technology that took over those functions from MapReduce. YARN -- short
for Yet Another Resource Negotiator, but typically referred to by the
acronym alone -- ended the strict reliance on MapReduce and opened up
Hadoop to other processing engines and various applications besides batch
jobs. For example, Hadoop can now run applications on the Apache
Spark, Apache Flink, Apache Kafka and Apache Storm engines.

In Hadoop clusters, YARN sits between HDFS and the processing


engines deployed by users. The resource manager uses a combination of
containers, application coordinators and node-level monitoring agents to
dynamically allocate cluster resources to applications and oversee the
execution of processing jobs in a decentralized process. YARN supports
multiple job scheduling approaches, including a first-in-first-out queue and
several methods that schedule jobs based on assigned cluster resources.

The Hadoop 2.0 series of releases also added high availability and
federation features for HDFS, support for running Hadoop clusters on
Microsoft Windows servers and other capabilities designed to expand the
distributed processing framework's versatility for big data management and
analytics.

Hadoop 3.0.0 was the next major version of Hadoop. Released by


Apache in December 2017, it added a YARN Federation feature designed to
enable YARN to support tens of thousands of nodes or more in a single
cluster, up from a previous 10,000-node limit. The new version also
included support for GPUs and erasure coding, an alternative to data
replication that requires significantly less storage space.

Subsequent 3.1.x and 3.2.x updates enabled Hadoop users to run


YARN containers inside Docker ones and introduced a YARN service
framework that functions as a container orchestration platform. Two new
Hadoop components were also added with those releases: a machine
learning engine called Hadoop Submarine and the Hadoop Ozone object
store, which is built on the Hadoop Distributed Data Store block storage
layer and designed for use in on-premises systems.

Modules of Hadoop
1.HDFS: Hadoop Distributed File System. Google published its paper
GFS and on the basis of that HDFS was developed. It states that the
files will be broken into blocks and stored in nodes over the
distributed architecture.
2.Yarn: Yet another Resource Negotiator is used for job scheduling
and manage the cluster.
3.Map Reduce: This is a framework which helps Java programs to do the
parallel computation on data using key value pair. The Map task takes
input data and converts it into a data set which can be computed in Key
value pair. The output of Map task is consumed by reduce task and then
the out of reducer gives the desired result.
Hadoop Common: These Java libraries are used to start
Hadoop and are used by other Hadoop modules.
Hadoop Architecture:-

The Hadoop architecture is a package of the file system,


MapReduce engine and the HDFS (Hadoop Distributed File
System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave
nodes. The master node includes Job Tracker, Task Tracker,
NameNode, and DataNode whereas the slave node includes
DataNode and TaskTracker.

Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped
which helps in faster retrieval. Even the tools to process the data are
often on the same servers, thus reducing the processing time. It is able to
process terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in
the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware
to store data so it really cost effective as compared to traditional
relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate
data over the network, so if one node is down or some other network
failure happens, then Hadoop takes the other copy of data and use it.
Normally, data are replicated thrice but the replication factor is
configurable.

Features Of Hadoop
 It is best-suited for Big Data analysis

Typically, Big Data has an unstructured and distributed nature.


This is what makes Hadoop clusters best suited for Big Data
analysis. Hadoop functions on the ‘data locality’ concept,
which means that instead of the actual data, the processing
logic flows to the computing nodes, thereby consuming less
network bandwidth. This increases the efficiency of Hadoop
applications.
 It is scalable

The best thing about Hadoop clusters is that you can scale them
to any extent by adding additional cluster nodes to the network
without incorporating any modifications to application logic.
So, as the Big Data volume, variety, and velocity increase, you
can also scale the Hadoop cluster to accommodate the growing
data needs.
 It is fault-tolerant

In the Hadoop ecosystem, there’s a provision to replicate the


input data to other cluster nodes as well. Thus, if ever a cluster
node fails, data processing will not come to a standstill as
another cluster node can replace the failed node and continue
the process.
Hadoop Applications in the real-world
1. Security and Law Enforcement
Yes, Hadoop is now used as an active tool in Law enforcement.
Thanks to its speedy and reliable Big Data analysis, Hadoop is
helping Law enforcement agencies (like the police department)
to become more proactive, efficient, and accountable. For
instance, the national security agency of the USA uses Hadoop
to prevent terrorist attacks. Since Hadoop can help detect
security breaches and suspicious activities in real-time, it has
become an effective tool to predict criminal activity and catch
criminals.
Explore our Popular Software Engineering Courses
2. Enhance customer satisfaction and monitor online
reputation
Businesses are now using Hadoop to analyze sales data and compare
it against many other factors to determine when and at which time a
specific product sells best. By continually monitoring sales data,
business owners can find out why certain products sell better on
particular days or hours or season. In the same way, Hadoop can
also mine social media and online conversations to see what your
customers (both existing and potential) are saying about you on
online platforms. It monitors the sentiments behind the comments
and feedback of the customers. This insight helps marketers and
business owners to analyze customer pain points and what they
expect from the brand. All of this vital information can be used by
businesses and companies to enhance the quality of their products,
boost customer satisfaction quotient, and improve their online
reputation.
3. Monitor patient vitals
Many hospitals have started leveraging Hadoop to make their staff
more productive in their work process. Healthcare systems and
machines generate large volumes of unstructured data. Conventional
data processing systems cannot process and analyze such large
quantities of raw data. However, Hadoop can. An excellent case in
point is when the Children’s Healthcare of Atlanta fitted a sensor
beside the bed of its ICU units to continually track the vital of child
patients such as blood pressure, heartbeat, and respiratory rate. The
primary aim was to store and analyze these critical signs and be
alerted if ever there occurred any change in the patterns. This
allowed the healthcare provider to promptly send a team of doctors
and medical assistants to check on patients in need. This was made
possible using the core components of the Hadoop ecosystem
components – Hive, Flume, Impala, Spark, and Sqoop.
4. Healthcare Intelligence
Healthcare insurance companies usually combine all the associated
costs (including the risks involved) and equally divide it by the total
number of members in a particular group. Naturally, the outcomes
are always dynamic since they keep changing. This is where
Hadoop’s scalable and inexpensive feature can be highly useful.
Hadoop can efficiently accommodate dynamic data and scale
according to the ever-changing needs. By using Hadoop-based
healthcare intelligence apps, both healthcare providers and
healthcare insurance companies can devise smart business solutions
at an affordable cost.
Let’s assume that a healthcare insurance company wishes to find the
age in a region where people below a certain age limit aren’t prone
to a specific disease. This is to be done to help the company to
calculate the approximate cost of the insurance policy. However, to
gather the age data of the people in the region, the company will
have to invest a large sum of money in processing and analyzing
vast volumes of datasets to extract relevant information regarding
the disease in question, its symptoms, its target victims, and so on.
This is where Hadoop components like Pig, Hive, and MapReduce
can come in handy – these can process large datasets at relatively
low costs.
In-Demand Software Development Skills
Essentially, Hadoop’s primary function is to store, process, and
analyze massive volumes of data, including clickstream data.
Hadoop can successfully capture the following:
 Where did a visitor originate from before reaching a

particular website?
 What search term did the visitor use that lead to the
website?
 Which webpage did the visitor open first?

 What are the other webpages that interested the visitor?

 How much time did the visitor spend on each page?

 What product/service did the visitor decide to buy?

By helping you find the answers to all such questions, Hadoop


offers an analysis of the user engagement and website
performance. Thus, by leveraging Hadoop, companies of all
shapes and sizes can conduct clickstream analysis to optimize
the user-path and predict what product/service the customer is
likely to buy next, and where to allocate their web resources.
6. Track geolocation data
Smartphones have become a crucial part of our lives now. With
the number of smartphone users around the world increasing as
we speak, these tiny devices are the heartbeat of the digital
world. So, why not capitalize on this opportunity and use
smartphones to your advantage? Businesses can use Hadoop to
track the geolocation data on smartphones and tablets to track
customers’ movements, behavior patterns, purchases, and
predict their next move. Not just that, Hadoop clusters can also
streamline massive amounts of geolocation data and help
organizations to identify the challenges in their business and
operation processes.
7. Track sensor data
Today, electronic gadgets and machines are using sensors to
enhance the user experience and more importantly, to harvest
customer data. The growing trend toward incorporating sensors
has become more pronounced following the increasing
adoption of IoT devices. In fact, sensor data is among the
fastest-growing data types now. Devices and machines are
infused with advanced sensors that can monitor and track a
host of features like temperature, speed, pressure, proximity,
location, image, price, motion, and much more. Since sensor
data tends to become overwhelming with time, Hadoop is the
best and most effective solution to track, store, and analyze
sensor data. By tracking and monitoring sensor data,
companies can obtain operational insights into their business
and improve their processes accordingly.
8. Strengthen security and compliance
Hadoop can efficiently analyze server-log data and respond to a
security breach in real-time. Server-logs are nothing but
computer-generated logs that capture network data operations,
particularly the security and regulatory compliance data.
Server-log provides companies and organizations important
insights pertaining to network usage, security threats and
compliance. Hadoop is the perfect fit for staging and analyzing
this data. It is an excellent tool to extract errors or detect the
occurrence of any suspicious event in a system (example, login
failures). By loading the server logs into Hadoop, network
admins can identify the cause of the security breach and fix the
issue promptly.
Although these are only a handful of Hadoop applications in
the real-world scenario, many more are yet to come. As the Big
Data use cases expand and Hadoop technology matures, we
will see more of such pioneering applications of Hadoop.
Features Of Hadoop
 It is best-suited for Big Data analysis

Typically, Big Data has an unstructured and distributed nature.


This is what makes Hadoop clusters best suited for Big Data
analysis. Hadoop functions on the ‘data locality’ concept,
which means that instead of the actual data, the processing
logic flows to the computing nodes, thereby consuming less
network bandwidth. This increases the efficiency of Hadoop
applications.
 It is scalable

The best thing about Hadoop clusters is that you can scale them
to any extent by adding additional cluster nodes to the network
without incorporating any modifications to application logic.
So, as the Big Data volume, variety, and velocity increase, you
can also scale the Hadoop cluster to accommodate the growing
data needs.
 It is fault-tolerant

In the Hadoop ecosystem, there’s a provision to replicate the


input data to other cluster nodes as well. Thus, if ever a cluster
node fails, data processing will not come to a standstill as
another cluster node can replace the failed node and continue
the process.
Hadoop Applications in the real-world:-
1. Security and Law Enforcement
Yes, Hadoop is now used as an active tool in Law enforcement.
Thanks to its speedy and reliable Big Data analysis, Hadoop is
helping Law enforcement agencies (like the police department)
to become more proactive, efficient, and accountable. For
instance, the national security agency of the USA uses Hadoop
to prevent terrorist attacks. Since Hadoop can help detect
security breaches and suspicious activities in real-time, it has
become an effective tool to predict criminal activity and catch
criminals.
Explore our Popular Software Engineering Courses
2. Enhance customer satisfaction and monitor online
reputation
Businesses are now using Hadoop to analyze sales data and
compare it against many other factors to determine when and at
which time a specific product sells best. By continually
monitoring sales data, business owners can find out why
certain products sell better on particular days or hours or
season. In the same way, Hadoop can also mine social media
and online conversations to see what your customers (both
existing and potential) are saying about you on online
platforms. It monitors the sentiments behind the comments and
feedback of the customers. This insight helps marketers and
business owners to analyze customer pain points and what they
expect from the brand. All of this vital information can be used
by businesses and companies to enhance the quality of their
products, boost customer satisfaction quotient, and improve
their online reputation.
3. Monitor patient vitals
Many hospitals have started leveraging Hadoop to make their
staff more productive in their work process. Healthcare systems
and machines generate large volumes of unstructured data.
Conventional data processing systems cannot process and
analyze such large quantities of raw data. However, Hadoop
can. An excellent case in point is when the Children’s
Healthcare of Atlanta fitted a sensor beside the bed of its ICU
units to continually track the vital of child patients such as
blood pressure, heartbeat, and respiratory rate. The primary aim
was to store and analyze these critical signs and be alerted if
ever there occurred any change in the patterns. This allowed
the healthcare provider to promptly send a team of doctors and
medical assistants to check on patients in need. This was made
possible using the core components of the Hadoop ecosystem
components – Hive, Flume, Impala, Spark, and Sqoop.
4. Healthcare Intelligence
Healthcare insurance companies usually combine all the associated
costs (including the risks involved) and equally divide it by the total
number of members in a particular group. Naturally, the outcomes
are always dynamic since they keep changing. This is where
Hadoop’s scalable and inexpensive feature can be highly useful.
Hadoop can efficiently accommodate dynamic data and scale
according to the ever-changing needs. By using Hadoop-based
healthcare intelligence apps, both healthcare providers and
healthcare insurance companies can devise smart business solutions
at an affordable cost.
Let’s assume that a healthcare insurance company wishes to
find the age in a region where people below a certain age limit
aren’t prone to a specific disease. This is to be done to help the
company to calculate the approximate cost of the insurance
policy. However, to gather the age data of the people in the
region, the company will have to invest a large sum of money
in processing and analyzing vast volumes of datasets to extract
relevant information regarding the disease in question, its
symptoms, its target victims, and so on. This is where Hadoop
components like Pig, Hive, and MapReduce can come in handy
– these can process large datasets at relatively low costs.
In-Demand Software Development Skills
5. Track clickstream data
Essentially, Hadoop’s primary function is to store, process, and
analyze massive volumes of data, including clickstream data.
Hadoop can successfully capture the following:
 Where did a visitor originate from before reaching a

particular website?
 What search term did the visitor use that lead to the

website?
 Which webpage did the visitor open first?

 What are the other webpages that interested the visitor?

 How much time did the visitor spend on each page?

 What product/service did the visitor decide to buy?

By helping you find the answers to all such questions, Hadoop


offers an analysis of the user engagement and website
performance. Thus, by leveraging Hadoop, companies of all
shapes and sizes can conduct clickstream analysis to optimize
the user-path and predict what product/service the customer is
likely to buy next, and where to allocate their web resources.
6. Track geolocation data
Smartphones have become a crucial part of our lives now. With
the number of smartphone users around the world increasing as
we speak, these tiny devices are the heartbeat of the digital
world. So, why not capitalize on this opportunity and use
smartphones to your advantage? Businesses can use Hadoop to
track the geolocation data on smartphones and tablets to track
customers’ movements, behavior patterns, purchases, and
predict their next move. Not just that, Hadoop clusters can also
streamline massive amounts of geolocation data and help
organizations to identify the challenges in their business and
operation processes.
7. Track sensor data
Today, electronic gadgets and machines are using sensors to
enhance the user experience and more importantly, to harvest
customer data. The growing trend toward incorporating sensors
has become more pronounced following the increasing
adoption of IoT devices. In fact, sensor data is among the
fastest-growing data types now. Devices and machines are
infused with advanced sensors that can monitor and track a
host of features like temperature, speed, pressure, proximity,
location, image, price, motion, and much more. Since sensor
data tends to become overwhelming with time, Hadoop is the
best and most effective solution to track, store, and analyze
sensor data. By tracking and monitoring sensor data,
companies can obtain operational insights into their business
and improve their processes accordingly.
8. Strengthen security and compliance
Hadoop can efficiently analyze server-log data and respond to a
security breach in real-time. Server-logs are nothing but
computer-generated logs that capture network data operations,
particularly the security and regulatory compliance data.
Server-log provides companies and organizations important
insights pertaining to network usage, security threats and
compliance. Hadoop is the perfect fit for staging and analyzing
this data. It is an excellent tool to extract errors or detect the
occurrence of any suspicious event in a system (example, login
failures). By loading the server logs into Hadoop, network
admins can identify the cause of the security breach and fix the
issue promptly.
Although these are only a handful of Hadoop applications in
the real-world scenario, many more are yet to come. As the Big
Data use cases expand and Hadoop technology matures, we
will see more of such pioneering applications of Hadoop.
THANK YOU !!

You might also like