Big Data
Big Data
Big Data
Kailash S
C-DAC
Chennai
Properties
Scalability
Data I/O Performance
Fault tolerance
Real-time processing
Data size supported
Iterative task support
Scalability
Horizontal scaling
Peer to Peer
MPI
Apache Hadoop
HDFS, YARN
Map reduce
Spark
Vertical scaling
Types of Analytics
Prescriptive Analytics
Predictive Analytics
Descriptive Analytics.
Database
SQOOP SQL + HADOOP
Terminology
Google calls it:
Hadoop equivalent:
MapReduce
Hadoop
GFS
HDFS
Bigtable
HBase
Chubby
Zookeeper
HADOOP
What is Hadoop ?
Hadoop Architecture
Data
Hadoop Cluster
DFS Block 1
DFS Block 1
DFS Block 1
DFS Block 2
DFS Block 2 MAP
Results
MAP
Reduce
DFS Block 2
MAP
DFS Block 3
DFS Block 3
DFS Block 3
Architecture of Hadoop DB
What is HDFS ?
HDFS Architecture
Architecture
Hadoop is based on Master-Slave architecture.
An HDFS cluster consists of a single,
Master node
HDFS Client
Application
Local file
system
Block size: 2K
Name Nodes
Block size: 128M
Replicated
6/23/2010
30
HDFS Architecture
Metadata ops
Metadata(Name, replicas..)
(/home/foo/data,6. ..
Namenode
Client
Block ops
Read
Datanodes
Datanodes
replication
B
Blocks
Rack1
Write
Client
Rack2
MAP REDUCE
Map-Reduce
MapReduce Architecture
Parallel Execution
Map
The master node takes the input, chops it up into
smaller sub-problems, and distributes those to
worker nodes.
A worker node may do this again in turn, leading
to a multi-level tree structure.
The worker node processes that smaller problem,
and passes the answer back to its master node.
Reduce
The master node then takes the answers to
all the sub-problems and combines them
in a way to get the output - the answer to
the problem it was originally trying to
solve.
JobTracker
Slave node
Slave node
Slave node
TaskTracker
TaskTracker
TaskTracker
Task instance
Task instance
Task instance
Parse-hash
Count
P-0000
, count1
Parse-hash
Count
P-0001
, count2
Parse-hash
Count
Parse-hash
6/23/2010
P-0002
,count3
40
Input key*value
pairs
Input key*value
pairs
...
map
map
Data store 1
Data store n
(key 1,
values...)
(key 2,
values...)
(key 3,
values...)
(key 2,
values...)
(key 1,
values...)
(key 3,
values...)
key 2,
intermediate
values
key 3,
intermediate
values
reduce
reduce
reduce
final key 1
values
final key 2
values
final key 3
values
Creating mapper
Mapper
InputFormat
Input file
Input file
InputSplit
InputSplit
InputSplit
InputSplit
RecordReader
RecordReader
RecordReader
RecordReader
Mapper
Mapper
Mapper
Mapper
(intermediates)
(intermediates)
(intermediates)
(intermediates)
Reading data
File input format and friends
Filtering file inputs
Record readers
Input split size
Sending data to reducers
Writable comparator
Sending data to client
shuffling
Mapper
Mapper
Mapper
(intermediates)
(intermediates)
(intermediates)
(intermediates)
Partitioner
Partitioner
Partitioner
Partitioner
(intermediates)
(intermediates)
(intermediates)
Reducer
Reducer
Reducer
Partitioner
Reduction
Output format
OutputFormat
Reducer
Reducer
Reducer
RecordWriter
RecordWriter
RecordWriter
output file
output file
output file
Pig
HBase
Supports both batch-style computations using MapReduce and point queries (random
reads).
ZooKeeper
A data flow language and execution environment for exploring very large datasets. Pig
runs on HDFS and MapReduce clusters.
Hive
Manages data stored in HDFS and provides a query language based on SQL (and which
is translated by the runtime engine to MapReduce jobs) for querying the data.
Hive
A database/data warehouse on top of Hadoop
Rich data types (structs, lists and maps)
Efficient implementations of SQL filters, joins and groupbys on top of map reduce
Hive Architecture
Map Reduce
Web UI
Mgmt, etc
Hive CLI
Browsing Queries DDL
Hive QL
MetaStore
Parser
Planner
Execution
SerDe
Thrift Jute JSON
Thrift API
HDFS
Data Warehousing
Stream Computing
SPARK
University of California
Reduce Disk I/O limitations
Java, Scala, Python
In-memory computation
100x faster
Berkeley Data Analytics stack
INSTALLATION OF PACKAGES
Linux packages
Binary installation
Source installation
Manual
Manual
Automated
Main package
Configuration
Installing Hadoop
http://www.motorlogy.com/apachemirror//hadoop/core/hadoop0.20.2/hadoop-0.20.2.tar.gz
Pre-requasites
JDK 1.5 or above
SSH
Configuring Hadoop
Steps to configure
Map the IP addresses of all nodes with its host name i.e /etc/hosts file
Add the master node and data nodes name in slaves file.
Continued ...
Create a SSH rsa key and copy it to all the data nodes from master node.
Configuring core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:8020</value>
</property>
</configuration>
Configuring hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
Configuring mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:8021</value>
</property>
</configuration>
Starting Hadoop
hadoop fs -ls
List all files on the HDFS.
hadoop fs -mkdir abcd
Creates a directory on HDFS
Continued ...