HADOOP ECOSYSTEM
Hadoop is a framework which deals with Big Data but unlike
any other framework it's not a simple framework, it has
its own family for processing different thing which is
tied up in one umbrella called as Hadoop Ecosystem.
Data is mainly categorized in 3 types under Big Data platform.
Structured Data - Data which has proper structure and which can be easily
stored in tabular form in any relational databases like Mysql, Oracle etc is
known as structured data.Example- Employee data .
Semi-Structured Data - Data which has some structure but cannot be saved in a
tabular form in relational databases is known as semi structured data.
Example-XML data, email messages etc.
Unstructured Data - Data which is not having any structure and cannot be saved
in tabular form of relational databases is known as unstructured data.
Example- Video files, Audio files, Text file etc.
SQOOP : SQL + HADOOP = SQOOP
When we import any structured data from table (RDBMS) to c
HDFS a file is created in HDFS which we can process by either Map
Reduce program directly or by HIVE or PIG.
FLUME
Flume is a distributed, reliable, and available system for efficiently
collecting, aggregating and moving large amounts of log data from
many different sources to a centralized data store. c
Flume can be used to transport massive quantities of event data
including but not limited to network traffic data, social-media-
generated data, email messages and pretty much any data source
possible.
HDFS
(HADOOP DISTRIBUTED FILE
SYSTEM)
HDFS is a main component of Hadoop and a
technique to store the data in distributed
manner in order to compute fast.
HDFS saves data in a block of 128 MB in size
which is logical splitting of data in a
Datanode (physical storage of data) in Hadoop
cluster(formation of several Datanode which is a
collection commodity hardware connected through
single network).
All information about data splits in datanode known
as metadata is captured in Namenode which is again
a part of HDFS.
YET ANOTHER RESOURCE NEGOTIATOR
(YARN) IN HADOOP 2.0
• Allocates resources for all scheduled tasks
• Two services
• Resource Manager (MASTER DAEMON)
• Manages resources and schedule applications running on top of
YARN.
• Node Manager (SLAVE DAEMON)
• Manages containers and monitors resource utilization in each
container.
MAPREDUCE FRAMEWORK
• It is another main component of Hadoop and a method of programming in
a distributed data stored in a HDFS.
• We can write MapReduce program by using any programming language like
c,c++,Java, R ,Python
• A job is divided into n Map and n Reduce task. Map does calculation and
Reduce aggregates it.
• Input and output are Key and Value
• MapReduce Program can be applied to any type of data whether
Structured or Unstructured stored in HDFS. Example - word count using
MapReduce
• MAP = Filtering, Grouping and Sorting
• REDUCE = Aggregates and summarizes the result produced by
map function
HBASE
Hadoop Database or HBASE is a non- relational (NoSQL)
database that runs on top of HDFS.
HBASE was created for large table which have billions of
rows and millions of columns with fault tolerance
capability and horizontal scalability and based on Google
Big Table.
Hadoop can perform only batch
processing, and data will be accessed only in a sequential
manner, for random access of huge data HBASE is used.
HIVE
Hive is created by Facebook and later donated to
Apache foundation.
Hive mainly deals with structured data which is stored in
HDFS with a Query Language similar to SQL and known
as HQL (Hive Query Language)
Hive also run Map reduce program in a backend to
process data in HDFS
• 2 basic components
• Hive Command Line
• Java Database Connectivity (JDBC) and Open Database
Connectivity (ODBC)
• Supports User Defined Function (UDF) to accomplish specific
needs.
PIG
Similar to HIVE, PIG also deals with structured data using
PIG LATIN language.
PIG was originally developed at Yahoo to answer similar need
to HIVE.
It is an alternative provided to programmer who loves scripting
and don't want to use Java/Python or SQL to process data.
A Pig Latin program is made up of a series of operations, or
transformations, that are applied to the input data which
runs MapReduce program in backend to produce output.
• By Yahoo!
• 1 line of piglatin = Approx. 100 lines of Map-Reduce job
• The compiler internally converts pig latin to map reduce
• It gives you platform for building data flow for Extract, Transform and Load (ETL)
• Pig
• Loads the data
• Group
• Filter
• Join
• Sort
HIVE PIG
CLI/UI/ Load
API MAP
(JDBC/ODB
C)
filterLocal
aggregatio
Meta JDBC Filter n
store ODBC
Thrift Client
Thrift
(PHP/
Server
Perl/Python/
C++/Java) Group Ditsribute
HiveQL
Driver
Compile HIV foreach
E
Reduce
Map Reduce HADOOP
Jobs Global
Aggrega
store tion
Executio n
Engine
MAHOUT
Mahout is an open-source machine learning library from
Apache written in java.
The algorithms it implements fall under the broad umbrella of
machine learning or collective intelligence.
Mahout aims to be the machine learning tool of choice when
the collection of data to be processed is very large, perhaps far
too large for a single machine.
Tasks : Predictive analysis-recommender, clustering,
classification
OOZIE
It is a workflow scheduler system to manage hadoop jobs.
Oozie is implemented as a Java Web-Application that runs in
a Java Servlet-Container.
Hadoop basically deals with bigdata and when some
programmer wants to run many job in a sequential manner
like output of job A will be input to Job B and similarly
output of job B is input to job C and final output will be
output of job C. To automate this sequence we need a
workflow and to execute same we need engine for which OOZIE
is used.
• OOZIE WORKFLOW
• Sequential set of actions to be executed.
• OOZIE COORDINATOR
• Oozie jobs which are triggered when the data is made available to it (or)
even triggered based on time.
ZOOKEEPER
• ZooKeeper is a centralized service for maintaining
configuration information, naming, providing distributed
synchronization, and providing group services .
• Writing distributed applications is difficult because of partial
failure may occur between nodes, to overcome this Apache
Zookeper has been developed by maintaining an open-source
server which enables highly reliable distributed coordination.
• In case of any partial failure clients can connect to any node
and be assured that they will receive the correct, up-to-date
information.
APACHE SPARK
• A framework for real time data analytics
• 100 X faster than Hadoop and Mapreduces
• Supports Standalone and Distributed processing
HDFS FILE SYSTEM
• Creating a File System Object
•
PATH = HDFS OBJECT
•
HADOOP SPECIFIC FILE SYSTEM TYPES
FEATURES OF HDFS
• Data Replication
• Data Resilience
• Data Integrity
• Maintaining Transaction logs
• Validating Checksum = Numerical value is assigned to a
transmitted message, Verification of content of a file.
• Creating Data blocks = Called as block servers
FUNCTIONS PERFORMED BY BLOCK SERVER
• Storage of data on a local file system.
• Storage of metadata of a block on the local file system on the basis of
similar template on the Name Node.
• Conduct of periodic validations for file checksums
• Intimation about availability of blocks to Name Nade by sending reports
regularly
• On-demand supply of metadata and data to clients
• Movement of data to connected nodes on the basis of pipelining model
• “sudo -u hdfs hdfs balancer”