0% found this document useful (0 votes)
74 views4 pages

Hadoop Project: Hardware Specific

The document discusses hardware and network considerations for a Hadoop project. It recommends using dedicated switches and gigabit or faster connections for Hadoop clusters. It also provides an example hardware sizing calculation based on sample data input and growth rates to determine the necessary number of data nodes. The document outlines steps for installing, deploying, tuning, managing high availability, and monitoring the Hadoop cluster.

Uploaded by

Nitin Kotiyal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views4 pages

Hadoop Project: Hardware Specific

The document discusses hardware and network considerations for a Hadoop project. It recommends using dedicated switches and gigabit or faster connections for Hadoop clusters. It also provides an example hardware sizing calculation based on sample data input and growth rates to determine the necessary number of data nodes. The document outlines steps for installing, deploying, tuning, managing high availability, and monitoring the Hadoop cluster.

Uploaded by

Nitin Kotiyal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Hadoop Project

Hardware Specific:-

The number of machines, and specs of the machines, depends on a few factors:

 The volume of Existing data and Monthly/Yearly growth


 The data retention policy (how much can you afford to keep before throwing away)
 The type of workload you have (data science/CPU driven vs “vanilla” use case/IO-bound)
 Also the data storage mechanism (data container, type of compression used if any)

Hardware Requirement 

Network Considerations: Hadoop is very bandwidth-intensive! Often, all nodes are communicating with
each other at the same time

 Use dedicated switches for your Hadoop cluster


 Nodes are connected to a top-of-rack switch
 Nodes should be connected at a minimum speed of 1Gb/sec
 For clusters where large amounts of intermediate data is generated, consider 10Gb/sec
connections –  Expensive –  Alternative: bond two 1Gb/sec connections to each node
 Racks are interconnected via core switches
 Core switches should connect to top-of-rack switches at 10Gb/ sec or faster
 Beware of oversubscription in top-of-rack and core switches
 Consider bonded Ethernet to mitigate against failure
 Consider redundant top-of-rack and core switches

Daily data input 100 GB Storage space used by daily data input = daily
HDFS replication factor 3 data input * replication factor = 300 GB
Monthly growth 5% Monthly volume = (300 * 30) + 5% =  9450 GB
After one year = 9450 * (1 + 0.05)^12 = 16971
GB
Intermediate MapReduce 25% Dedicated space = HDD size * (1 – Non HDFS
data reserved space per disk / 100 + Intermediate
MapReduce data / 100)
Non HDFS reserved space per 30% = 4 * (1 – (0.25 + 0.30)) = 1.8 TB (which is the
disk node capacity)
Size of a hard drive disk 4 TB  

Number of DataNodes needed to process:


Whole first month data = 6 nodes
The 12th month data = 10 nodes
Whole year data = 88 nodes

NameNode memory 8 GB – 16 GB Memory amount = HDFS cluster management


Secondary NameNode 8 GB – 16 GB memory + NameNode memory + OS memory
memory
OS memory 4 GB – 8 GB
HDFS memory 4 GB – 8 GB
At least NameNode (Secondary NameNode) memory = 8+ 8 + 4 = 20 GB

DataNode process memory 8 GB – 16 GB Memory amount = Memory per CPU core *


DataNode TaskTracker 8 GB – 16 GB number of CPU’s core + DataNode process
memory memory + DataNode TaskTracker memory + OS
OS memory 8 GB – 16 GB memory
CPU’s core number 4+
Memory per CPU core 4 GB – 8 GB  
At least DataNode memory = 4*4 + 8 + 8 + 8 = 40 GB

Operating System Recommendations

 RHEL or CentOS

Architecture review
Discuss key points that will dictate deployment decisions, including o Cluster Redundancy

 Data storage formats (HDFS, Hive, Hbase, other)


 Multitenancy o Job Scheduler (Is the capacity scheduler right for your needs?)
 File compression codec
 Schema design (structure of data stored in Hadoop: HDFS directory structures, output of data
processing and analysis, schemas of objects stored in systems such as HBase and Hive)
 File format (Plain-text, Sequencefile, Avro, Parquet, ORC, etc)
 Security model > Confirm Hadoop projects to be configured

> Determine software layout to each server

> Confirm sizing and data requirements

Pre-installation

 > Determine installation type


 > Validate environment readiness
 > Install Hadoop management tool (HMT) for provisioning, managing, and monitoring Hadoop
clusters (i.e. Ambari agents)

Installation and deployment

 > Use HMT to deploy to the agreed upon architecture

Hadoop high level overview

 > Provide overview of each subsystem


 > Ensure smoke test worked on each subsystem
 > Shutdown and restart all services

Hadoop cluster tuning


 Review cluster tuning parameters
 Confirm bench marking approach
 Generate test data using Teragen
 Run initial bench marks for baseline TestDFSIO, HiBench
 Tune YARN

Hue installation

 > Install and configure Hue


 > Test Hue

High availability

 > Configure HA Name Node


 > Test HA Name Node

Cluster management

 > Review add/remove nodes/services


 Hadoop configurations backup
 > Backup important site xml files
 > Backup key server configurations (i.e. Ambari)
 > Backup key agent configurations (i.e Ambari)
 > Backup Postgres database
 > Backup Hive Meta store

Cluster monitoring

 > Provide HMT alert overview (i.e. Ambari)


 > Provide HMT metrics overview (i.e. Ambari)

Documentation

 Complete build document (i.e. architecture decisions made)


 Complete operational runbook

You might also like