The solution for Big data
HADOOP
J. Sai Krishna
and G. Sravya
Lahari
2nd B.Tech (CSE)
K.O.R.M College of Engineering
Kadapa
Contents
1. Data trends in storing data.
2. Bigdata problems in IT industry
3. Introduction to HADOOP
4. HDFS (Hadoop Distributed File System)
MapReduce
6. Prominent users of Hadoop.
7. Conclusion
5.
Data trends in storing data
What is data--- Any real world symbol
(character,
numeric, special character)
or a of group of
them is
said to be
data it may be of the visual or audio or
scriptural ,etc
Big data
What is big dataIn IT, it is a collection
of data sets so large and complex data that
it becomes difficult to process using onhand database management tools or
traditional data processing applications.
As of 2012, limits on the size of data sets
that are feasible to process in reasonable
time were on the order of Exabyte of data.
BIGDATA and problems with
it.
Daily about 0.5 Petabytes of updates are being made
into FACEBOOK including 40 millions photos.
Daily, YOUTUBE is loaded with videos that can be
watched for one year continuously
Limitations are encountered due to large data sets in
many areas, including meteorology, genomics,
complex physics simulations, and biological and
environmental research.
Also affect Internet search, finance and business
informatics.
The challenges include in capture, retrieval, storage,
search, sharing, analysis, and visualization.
THEN WHAT COULD BE THE
SOLUTION FOR BIGDATA
HADOOP
What is Hadoop?
It is a opensource software written in java
Hadoop software library is a framework that
allows for the distributed processing of
large data sets across clusters of
computers using simple programming
models.
It is designed to scale up from single
servers to thousands of machines, each
offering local computation and storage.
The project includes these
modules:
Hadoop Common
Hadoop Distributed File System
(HDFS)
Hadoop MapReduce
1.Hadoop Commons
It provides access to the filesystems
supported by Hadoop.
The Hadoop Common package contains the
necessary JAR files and scripts needed to
start Hadoop.
The package also provides source code,
documentation, and a contribution section
which includes projects from the Hadoop
Community (Avro, Cassandra, Chukwa,
Hbase, Hive, Mahout, Pig, ZooKeeper)
2. Hadoop Distributed File
System (HDFS):
Hadoop uses HDFS, a distributed file
system based on GFS (Google File System),
as its shared filesystem.
HDFS architecture divides files into large
chunks (~64MB) distributed across data
servers (this is configurable).
It has a namenode and datanodes
What does a HDFS contain
HDFS consists of a global namenodes or
namespaces and they are federated.
The datanodes are used as common
storage for blocks by all the Namenodes.
Each datanode registers with all the
Namenodes in the cluster.
Datanodes send periodic heartbeats and
block reports and handles commands
from the Namenodes
Structure of Hadoop system:
MASTER NODE
Master node
Keeps
track of namespace and metadata about items
Keeps track of MapReduce jobs in the system
Hadoop currently configured with centurion064 as
the master node
Hadoop is locally installed in each system.
Installed location is in /localtmp/hadoop/hadoop0.15.3
SLAVE NODES
Slave nodes
Manage
blocks of data sent from master node
In common, these are the chunkservers
Currently centurion060, centurion064 are the two
slave nodes being used.
Slave nodes store their data in
/localtmp/hadoop/hadoop-dfs (this is automatically
created by the DFS)
Once you use the DFS, relative paths are from
/usr/{your usr id}
Advantages and Limitations of
HDFS
Reduce traffic on job
scheduling.
File access can be
achieved through
the native Java or
language of the
users' choice (C++,
Java, Python, PHP,
Ruby, Erlang, Perl,
Haskell, C#, Cocoa,
Smalltalk, and
OCaml),
It cannot be
directly mounted
by an existing
operating system.
It should be
provided with UNIX
or LUNIX system.
3.Hadoop MAPREDUCE
SYSTEM
MAP AND REDUCE METHODS USAGE
Map function
Reduce function
Run this program as a
MapReduce job
WORD COUNT OVER A GIVEN
SET OF STRINGS
We love India
We play
tennis
We
1
love
1
India
We
1
Play
Map
1
tennis
Love
India
1
1
We
2
tennis 1
play
1
Reduce
MAPREDUCE IN WITH NO REDUCE TASKS
MAPREDUCE WITH TWO REDUCE
TASKS - AUTOMATIC PARALLEL
EXECUTION IN MAPREDUCE
MapReduce - lifecycle
Input
Splits
Map
function
Map phase
Reduce
function
Reduce phase
Shuffle and sort in MapReduce
with multiple reduce tasks
Prominent users of HADOOP
Amazon 100 nodes
Facebook two clusters of 8000 and 3000
nodes
Adobe 80 node system
EBay 532 node cluster
yahoo cluster of about 4500 nodes
IIIT Hyderabad 30 node cluster
Achievements
March 2011 - Apache Hadoop takes top
prize at Media Guardian Innovation Award
July 2012 - Hadoop Wins Terabyte Sort
Benchmark
Conclusion:
It reduce traffic on capture, storage, search,
sharing, analysis, and visualization.
A huge amount of data could be stored and large
computations could be done in a single
compound with full safety and security at cheap
cost.
BIGDATA and BIGDATA-SOLUTIONS is one of the
burning issues in the present IT industry so, work
on those will surely make you more useful to that.
Thank
you
Any queries