Big Data
Big Data
Web Audios
logs
Images
Videos
Sensor
Data
Veracity
Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and
mined meaningful to the problem being analyzed. In scoping out your big data strategy one need to keep
the data clean and processes to keep ‘dirty data’ from accumulating in your systems.
Common Big Data Customer Scenarios
Web and e-tailing Healthcare & Life Sciences
Recommendation Engines Health Information Exchange
Ad Targeting Gene Sequencing
Search Quality Serialization
Abuse and Click Fraud Detection Healthcare Service Quality
Telecommunications Improvements
Customer Churn Prevention Drug Safety
Network Performance Optimization Banks and Financial services
Calling Data Record (CDR) Analysis Modeling True Risk
Analyzing Network to Predict Failure Threat Analysis
Government Fraud Detection
Fraud Detection and Cyber Security Trade Surveillance
Welfare Schemes Credit Scoring and Analysis
Justice Retail
Point of Sales Transaction Analysis
Customer Churn Analysis
Sentiment Analysis
Hidden Treasure
Insight into data can provide • Case Study: Sears Holding Corporation
Business Advantage.
Some key early indicators can
mean Fortunes to Business.
More Precise Analysis with
more data. X
Sears was using traditional
systems such as Oracle
Exadata, Teradata and SAS
etc. to store and process the
customer activity and sales
data.
Limitations of Existing Data Analytics
Architecture
Instrumentation
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
Solution: A Combined Storage
Computer Layer
BI Reports + Interactive Apps 1. Data Exploration &
Advanced analytics
RDBMS (Aggregated Data)
No Data
Archiving
Entire ~2PB 2. Scalable throughput for ETL &
Data is aggregation
available for
processing Hadoop : Storage + Compute Grid 3. Keep data alive
forever
Mostly Append
Both
Storage Collection
And
Processing Instrumentation
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather
than a meagre 10% as was the case with existing Non-Hadoop solutions.
Hadoop Differentiating Factors
Accessible
Scalable
Why DFS?
Read 1 TB Data
1 Machine 10 Machines
4 I/O Channels
4 I/O Channels
Each Channel – 100 MB/s
Each Channel – 100 MB/s
It is an Open-source
Data Management with
scale-out storage &
distributed processing.
Hadoop Eco-System
MapReduce Framework
HBase
HDFS (Hadoop Distributed File System)
Flume Sqoop
Import Or Export
Unstructured or
Semi-Structured data Structured Data
Hadoop Core Components
HDFS Admin Node Data Node Data Node Data Node Data Node
Cluster Name node
Hadoop Core Components
• Hadoop is a system for large scale data processing.
MapReduce (Processing)
Splits a task across processors
“near” the data & assembles results
Self-Healing, High Bandwidth
Clustered storage
JobTracker manages the TaskTrackers
HDFS Definition
• HDFS is a distributed, scalable, and portable filesystem written in Java for
the Hadoop framework.
• HDFS creates multiple replicas of data blocks and distributes them on
compute nodes throughout a cluster to enable reliable, extremely rapid
computations.
• HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.
• HDFS provides high throughput access to application data and is suitable
for applications that have large data sets
• HDFS consists of following components (daemons)
– Name Node
– Data Node
– Secondary Name Node
HDFS Components
Namenode:
•NameNode, a master server, manages the file system
namespace and regulates access to files by clients.
•Maintains and Manages the blocks which are present on
the datanode.
Meta-data in Memory
• – The entire metadata is in main memory
Types of Metadata
• – List of files
• – List of Blocks for each file
• – List of DataNodes for each block
• – File attributes, e.g. creation time, replication
A Transaction Log
• – Records file creations, file deletions. Etc.
HDFS Components
Data Node:
•DataNodes, one per node in the cluster, manages
storage attached to the nodes that they run on.
A Block Server
Stores data in the local file system (e.g. ext3)
Stores meta-data of a block (e.g. CRC)
Serves data and meta-data to Clients
Block Report
Periodically sends a report of all existing blocks to
the NameNode
Facilitates Pipelining of Data
Forwards data to other specified DataNodes
Understanding the File system
Block placement
• Current Strategy
One replica on local node
Second replica on a remote rack
Third replica on same remote rack
Additional replicas are randomly placed
• Clients read from nearest replica
Data Correctness
• Use Checksums to validate data
Use CRC32
• File Creation
Client computes checksum per 512 byte
DataNode stores the checksum
• File access
Client retrieves the data and checksum from DataNode
If Validation fails, Client tries other replicas
Understanding the File system
Data pipelining
Client retrieves a list of DataNodes on which to place replicas of a block
Client writes block to the first DataNode
The first DataNode forwards the data to the next DataNode in the Pipeline
When all replicas are written, the Client moves on to write the next block in file
Rebalancer
• Goal: % of disk occupied on Datanodes should be similar
Usually run when new Datanodes are added
Cluster is online when Rebalancer is active
Rebalancer is throttled to avoid network congestion
Command line tool
Secondary NameNode
Secondary NameNode:
Not a hot standby for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NameNode metadata
Saved metadata can build a failed NameNode
HDFS Architecture
Client
Block ops
Read DataNodes DataNodes
Replication Blocks
Write
Rack 2
Rack 1 Client
Anatomy of A File Write
2. Create
1. Create NameNode
HDFS
Client 3. Write Distributed
File System 7. Complete
6. Close
Client Node
4. Read 5. Read
Slide 21
Replication and Rack Awareness
Hadoop Cluster Architecture (Contd.)
Client
Core core-site.xml
HDFS hdfs-site.xml
Map
Reduce mapred-site.xml
Hadoop Configuration Files
Configuration Description of Log Files
Filenames
hadoop-env.sh Environment variables that are used in the scripts to run Hadoop.
core-site.xml Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and
MapReduce.
hdfs-site.xml Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data
nodes.
mapred-site.xml Configuration settings for MapReduce daemons : the job-tracker and the task-trackers.
masters A list of machines (one per line) that each run a secondary namenode.
slaves A list of machines (one per line) that each run a datanode and a task-tracker.
core-site.xml and hdfs-site.xml
hdfs-site.xml core-site.xml
<!--hdfs-site.xml--> <!--core-site.xml-->
<configuration> <configuration>
<property> <property>
<name>dfs.replication</name> <name>fs.default.name</name>
<value>1</value> <value>hdfs://localhost:8020/</value>
</property> </property>
</configuration> </configuration>
Defining HDFS Details In hdfs-site.xml
${hadoop.tmp.dir}/dfs/namesecondary
Defining mapred-site.xml
Property Value Description
mapred.job.tracker <value> The hostname and the port that the jobtracker
localhost:8021 RPC server runs on. If set to the default value
</value> of local, then the jobtracker runs in-process on
demand when you run a MapReduce job.
Slaves
Contains a list of hosts, one per line, that are to host DataNode and
TaskTracker servers.
Masters
Contains a list of hosts, one per line, that are to host Secondary
NameNode servers.
All Properties
1. http://hadoop.apache.org/docs/r1.1.2/core-default.html
2. http://hadoop.apache.org/docs/r1.1.2/mapred-default.html
3. http://hadoop.apache.org/docs/r1.1.2/hdfs-default.html
Web UI URLs
NameNode status:
http://localhost:50070/dfshealth.jsp
JobTracker status:
http://localhost:50030/jobtracker.jsp
TaskTracker status:
http://localhost:50060/tasktracker.jsp
• NameNode machines - the machines on which you run the Active and Standby
NameNodes should have equivalent hardware to each other, and equivalent
hardware to what would be used in a non-HA cluster.
• Shared storage - you will need to have a shared directory which both NameNode
machines can have read/write access to. Typically this is a remote filer which
supports NFS and is mounted on each of the NameNode machines.
• Currently only a single shared edits directory is supported. Beacuse of this, it is
recommended that the shared storage server be a high-quality dedicated NAS
appliance rather than a simple Linux server.
• Note that, in an HA cluster, the Standby NameNode also performs checkpoints of
the namespace state, and thus it is not necessary to run a Secondary NameNode,
CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an
error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to
be HA-enabled to reuse the hardware which they had previously dedicated to the
Secondary NameNode.
QJM-HA (Architecture)
• In a typical HA cluster, two separate machines are configured as NameNodes. At
any point in time, exactly one of the NameNodes is in an Active state, and the
other is in a Standby state. The Active NameNode is responsible for all client
operations in the cluster, while the Standby is simply acting as a slave, maintaining
enough state to provide a fast failover if necessary.
• In order for the Standby node to keep its state synchronized with the Active node,
both nodes communicate with a group of separate daemons called
“JournalNodes” (JNs). When any namespace modification is performed by the
Active node, it durably logs a record of the modification to a majority of these JNs.
The Standby node is capable of reading the edits from the JNs, and is constantly
watching them for changes to the edit log. As the Standby Node sees the edits, it
applies them to its own namespace. In the event of a failover, the Standby will
ensure that it has read all of the edits from the JounalNodes before promoting
itself to the Active state. This ensures that the namespace state is fully
synchronized before a failover occurs.
• In order to provide a fast failover, it is also necessary that the Standby node have
up-to-date information regarding the location of blocks in the cluster. In order to
achieve this, the DataNodes are configured with the location of both NameNodes,
and send block location information and heartbeats to both.
QJM (Hardware resources)
• NameNode machines - the machines on which you run the Active and
Standby NameNodes should have equivalent hardware to each other, and
equivalent hardware to what would be used in a non-HA cluster.
• JournalNode machines - the machines on which you run the
JournalNodes. The JournalNode daemon is relatively lightweight, so these
daemons may reasonably be collocated on machines with other Hadoop
daemons, for example NameNodes, the JobTracker, or the YARN
ResourceManager. Note: There must be at least 3 JournalNode daemons,
since edit log modifications must be written to a majority of JNs. This will
allow the system to tolerate the failure of a single machine. You may also
run more than 3 JournalNodes, but in order to actually increase the
number of failures the system can tolerate, you should run an odd number
of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the
system can tolerate at most (N - 1) / 2 failures and continue to function
normally.
Automatic failover
• In HA mode, the system will not automatically trigger a failover from the active to
the standby NameNode, even if the active node has failed.
• Automatic failover adds two new components to an HDFS deployment: a
ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC).
• Apache ZooKeeper is a highly available service for maintaining small amounts of
coordination data, notifying clients of changes in that data, and monitoring clients
for failures. The implementation of automatic HDFS failover relies on ZooKeeper
for the following things:
• Failure detection - each of the NameNode machines in the cluster maintains a
persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will
expire, notifying the other NameNode that a failover should be triggered.
• Active NameNode election - ZooKeeper provides a simple mechanism to
exclusively elect a node as active. If the current active NameNode crashes, another
node may take a special exclusive lock in ZooKeeper indicating that it should
become the next active.
• The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client
which also monitors and manages the state of the NameNode.
Practice Session
HDFS Commands
• Print the Hadoop version
hadoop version
• List the contents of the root directory in HDFS
hadoop fs -ls /
• Report the amount of space used and
available on currently mounted filesystem
hadoop fs -df hdfs:/
HDFS Commands
• Count number of directories, files and bytes
for paths that match specified file pattern
hadoop fs -count hdfs:/
• Run a DFS filesystem checking utility
hadoop fsck – /
• Run a cluster balancing utility
hadoop balancer
HDFS Commands
• Create a new directory named “hadoop”
below the /user/training directory in HDFS.
hadoop fs -mkdir /user/training/hadoop
• Add sample text file from local directory
named “data” to new directory created in
HDFS
hadoop fs -put data/sample.txt
/user/training/hadoop
HDFS Commands
• List the contents of this new directory in HDFS.
hadoop fs -ls /user/training/hadoop
• Add the entire local directory called “retail” to
the /user/training directory in HDFS.
hadoop fs -put data/retail /user/training/hadoop
• See how much space this directory occupies in
HDFS.
hadoop fs -du -s -h hadoop/retail
HDFS Commands
• Move a directory from one location to other
hadoop fs -mv hadoop apache_hadoop
• Delete a file ‘customers’ from the “retail”
directory.
hadoop fs -rm hadoop/retail/customers
• Ensure this file is no longer in HDFS.
hadoop fs -ls hadoop/retail/customers
• Delete all files from the “retail” directory using a
wildcard.
hadoop fs -rm hadoop/retail/*
HDFS Commands
• To empty the trash
hadoop fs –expunge
• Finally, remove the entire retail directory and all
of its contents in HDFS.
hadoop fs -rm -r hadoop/retail
• Add the purchases.txt file from the local directory
named “/home/training/” to the hadoop
directory you created in HDFS
hadoop fs -copyFromLocal
/home/training/purchases.txt hadoop/
HDFS Commands
• To view contents of your text file purchases.txt
which is present in your hadoop directory.
hadoop fs -cat hadoop/purchases.txt
• Add purchases.txt file from HDFS directory to the
local directory
hadoop fs -copyToLocal hadoop/purchases.txt
/home/training/data
• Use ‘-chown’ to change owner name and group
name simultaneously
sudo -u hdfs hadoop fs -chown root:root
hadoop/purchases.txt
HDFS Commands
• Use ‘-setrep’ command to change replication
factor of a file (default -3)
hadoop fs -setrep -w 2 apache_hadoop/sample.txt
• Command to make the name node leave safe
mode
hadoop fs -expunge
sudo -u hdfs hdfs dfsadmin -safemode leave
HDFS Commands
Parallel Copying with distcp
• The HDFS access patterns that we have seen so far focus on single-
threaded access.
• Hadoop comes with a useful program called distcp for copying data to and
from Hadoop filesystems in parallel.
• distcp is as an efficient replacement for hadoop fs -cp.
• Example,
hadoop distcp file1 file2
• You can also copy directories:
hadoop distcp dir1 dir2
(If dir2 does not exist, it will be created, and the contents of the dir1
directory will be copied there. You can specify multiple source paths, and all
will be copied to the destination. If dir2 already exists, then dir1 will be
copied under it, creating the directory structure dir2/dir1.)
Thank You
• Question?
• Feedback?