CASE STUDY On Application of Hadoop
CASE STUDY On Application of Hadoop
Quantum University,
Mandawara (22 km millstone)
Roorkee Dehradun highway.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its
origin was the Google File System paper, published by Google.
INTRODUCTION OF HADOOP: -
Because Hadoop can process and store such a wide assortment of data, it
enables organizations to set up data lakes as expansive reservoirs for
incoming streams of information. In a Hadoop data lake, raw data is often
stored as is so data scientists and other analysts can access the full data
sets, if need be; the data is then filtered and prepared by analytics or IT
teams, as needed, to support different applications.
The Hadoop 2.0 series of releases also added high availability and
federation features for HDFS, support for running Hadoop clusters on
Microsoft Windows servers and other capabilities designed to expand the
distributed processing framework's versatility for big data management and
analytics.
Modules of Hadoop
1.HDFS: Hadoop Distributed File System. Google published its paper
GFS and on the basis of that HDFS was developed. It states that the
files will be broken into blocks and stored in nodes over the
distributed architecture.
2.Yarn: Yet another Resource Negotiator is used for job scheduling
and manage the cluster.
3.Map Reduce: This is a framework which helps Java programs to do the
parallel computation on data using key value pair. The Map task takes
input data and converts it into a data set which can be computed in Key
value pair. The output of Map task is consumed by reduce task and then
the out of reducer gives the desired result.
Hadoop Common: These Java libraries are used to start
Hadoop and are used by other Hadoop modules.
Hadoop Architecture:-
Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped
which helps in faster retrieval. Even the tools to process the data are
often on the same servers, thus reducing the processing time. It is able to
process terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in
the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware
to store data so it really cost effective as compared to traditional
relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate
data over the network, so if one node is down or some other network
failure happens, then Hadoop takes the other copy of data and use it.
Normally, data are replicated thrice but the replication factor is
configurable.
Features Of Hadoop
It is best-suited for Big Data analysis
The best thing about Hadoop clusters is that you can scale them
to any extent by adding additional cluster nodes to the network
without incorporating any modifications to application logic.
So, as the Big Data volume, variety, and velocity increase, you
can also scale the Hadoop cluster to accommodate the growing
data needs.
It is fault-tolerant
particular website?
What search term did the visitor use that lead to the
website?
Which webpage did the visitor open first?
The best thing about Hadoop clusters is that you can scale them
to any extent by adding additional cluster nodes to the network
without incorporating any modifications to application logic.
So, as the Big Data volume, variety, and velocity increase, you
can also scale the Hadoop cluster to accommodate the growing
data needs.
It is fault-tolerant
particular website?
What search term did the visitor use that lead to the
website?
Which webpage did the visitor open first?