0% found this document useful (0 votes)
65 views51 pages

Big Data

Big data refers to large, complex datasets that are difficult to process using traditional database management tools or data processing applications. The challenges include capturing, storing, searching, sharing, transferring, analyzing, and visualizing large amounts of data. Hadoop is an open-source framework that allows distributed processing of large datasets across clusters of computers. It uses HDFS for storage and MapReduce as its processing layer. HDFS stores data across clusters as blocks and uses namenodes and datanodes to manage the file system and storage.

Uploaded by

pratiksha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views51 pages

Big Data

Big data refers to large, complex datasets that are difficult to process using traditional database management tools or data processing applications. The challenges include capturing, storing, searching, sharing, transferring, analyzing, and visualizing large amounts of data. Hadoop is an open-source framework that allows distributed processing of large datasets across clusters of computers. It uses HDFS for storage and MapReduce as its processing layer. HDFS stores data across clusters as blocks and uses namenodes and datanodes to manage the file system and storage.

Uploaded by

pratiksha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

What is Big Data?

Lots of Data (Terabytes or Petabytes)


Big data is the term for a collection of data sets so large and complex that
it becomes difficult to process using on-hand database management
tools or traditional data processing applications.
The challenges include capture, storage, search, sharing, transfer,
analysis, and visualization.
NYSE generates about one terabyte of new trade data per day to perform
stock trading analytics to determine trends for optimal trades.
3V’s – Big Data

Web Audios
logs
Images
Videos

Sensor
Data

Volume Velocity Variety

Veracity
Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and
mined meaningful to the problem being analyzed. In scoping out your big data strategy one need to keep
the data clean and processes to keep ‘dirty data’ from accumulating in your systems.
Common Big Data Customer Scenarios
Web and e-tailing Healthcare & Life Sciences
Recommendation Engines Health Information Exchange
Ad Targeting Gene Sequencing
Search Quality Serialization
Abuse and Click Fraud Detection Healthcare Service Quality
Telecommunications Improvements
Customer Churn Prevention Drug Safety
Network Performance Optimization Banks and Financial services
Calling Data Record (CDR) Analysis Modeling True Risk
Analyzing Network to Predict Failure Threat Analysis
Government Fraud Detection
Fraud Detection and Cyber Security Trade Surveillance
Welfare Schemes Credit Scoring and Analysis
Justice Retail
Point of Sales Transaction Analysis
Customer Churn Analysis
Sentiment Analysis
Hidden Treasure
Insight into data can provide • Case Study: Sears Holding Corporation

Business Advantage.
Some key early indicators can
mean Fortunes to Business.
More Precise Analysis with
more data. X
Sears was using traditional
systems such as Oracle
Exadata, Teradata and SAS
etc. to store and process the
customer activity and sales
data.
Limitations of Existing Data Analytics
Architecture

BI Reports + Interactive Apps


1. Can’t explore original
A meagre high fidelity raw data Processing
10% of the RDBMS (Aggregated Data)
~2PB Data is
available for
BI ETL Compute Grid
90% of
2. Moving data to compute the ~2PB
Archived
doesn’t scale
Storage only Grid (original Raw Data)
3. Premature data
Storage death
Mostly Append
Collection

Instrumentation

http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
Solution: A Combined Storage
Computer Layer
BI Reports + Interactive Apps 1. Data Exploration &
Advanced analytics
RDBMS (Aggregated Data)
No Data
Archiving
Entire ~2PB 2. Scalable throughput for ETL &
Data is aggregation
available for
processing Hadoop : Storage + Compute Grid 3. Keep data alive
forever

Mostly Append
Both
Storage Collection
And
Processing Instrumentation

*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather
than a meagre 10% as was the case with existing Non-Hadoop solutions.
Hadoop Differentiating Factors

Accessible

Simple Differentiating Robust


Factors

Scalable
Why DFS?
Read 1 TB Data

1 Machine 10 Machines
4 I/O Channels
4 I/O Channels
Each Channel – 100 MB/s
Each Channel – 100 MB/s

45 Minutes 4.5 Minutes


What is Hadoop?
Apache Hadoop is a
framework that allows
for the distributed
processing of large data
sets across clusters of
commodity computers
using a simple
programming model.

It is an Open-source
Data Management with
scale-out storage &
distributed processing.
Hadoop Eco-System

Apache Oozie (Workflow)

Hive Pig Latin Mahout


DW System Data Analysis Machine Learning

MapReduce Framework
HBase
HDFS (Hadoop Distributed File System)

Flume Sqoop

Import Or Export

Unstructured or
Semi-Structured data Structured Data
Hadoop Core Components

MapReduce Job Tracker Task Task Task Task


Engine Tracker Tracker Tracker Tracker

HDFS Admin Node Data Node Data Node Data Node Data Node
Cluster Name node
Hadoop Core Components
• Hadoop is a system for large scale data processing.

It has two main components:


HDFS – Hadoop Distributed File System (Storage)
Distributed across “nodes”
Natively redundant
NameNode tracks locations.

MapReduce (Processing)
Splits a task across processors
“near” the data & assembles results
Self-Healing, High Bandwidth
Clustered storage
JobTracker manages the TaskTrackers
HDFS Definition
• HDFS is a distributed, scalable, and portable filesystem written in Java for
the Hadoop framework.
• HDFS creates multiple replicas of data blocks and distributes them on
compute nodes throughout a cluster to enable reliable, extremely rapid
computations.
• HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.
• HDFS provides high throughput access to application data and is suitable
for applications that have large data sets
• HDFS consists of following components (daemons)
– Name Node
– Data Node
– Secondary Name Node
HDFS Components
Namenode:
•NameNode, a master server, manages the file system
namespace and regulates access to files by clients.
•Maintains and Manages the blocks which are present on
the datanode.
Meta-data in Memory
• – The entire metadata is in main memory
Types of Metadata
• – List of files
• – List of Blocks for each file
• – List of DataNodes for each block
• – File attributes, e.g. creation time, replication
A Transaction Log
• – Records file creations, file deletions. Etc.
HDFS Components
Data Node:
•DataNodes, one per node in the cluster, manages
storage attached to the nodes that they run on.
A Block Server
Stores data in the local file system (e.g. ext3)
Stores meta-data of a block (e.g. CRC)
Serves data and meta-data to Clients
Block Report
Periodically sends a report of all existing blocks to
the NameNode
Facilitates Pipelining of Data
Forwards data to other specified DataNodes
Understanding the File system
Block placement
• Current Strategy
One replica on local node
Second replica on a remote rack
Third replica on same remote rack
Additional replicas are randomly placed
• Clients read from nearest replica

Data Correctness
• Use Checksums to validate data
Use CRC32
• File Creation
Client computes checksum per 512 byte
DataNode stores the checksum
• File access
Client retrieves the data and checksum from DataNode
If Validation fails, Client tries other replicas
Understanding the File system
Data pipelining
Client retrieves a list of DataNodes on which to place replicas of a block
Client writes block to the first DataNode
The first DataNode forwards the data to the next DataNode in the Pipeline
When all replicas are written, the Client moves on to write the next block in file

Rebalancer
• Goal: % of disk occupied on Datanodes should be similar
Usually run when new Datanodes are added
Cluster is online when Rebalancer is active
Rebalancer is throttled to avoid network congestion
Command line tool
Secondary NameNode
Secondary NameNode:
Not a hot standby for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NameNode metadata
Saved metadata can build a failed NameNode
HDFS Architecture

Metadata ops NameNode Metadata (Name, replicas,…)

Client
Block ops
Read DataNodes DataNodes

Replication Blocks

Write
Rack 2
Rack 1 Client
Anatomy of A File Write

2. Create
1. Create NameNode
HDFS
Client 3. Write Distributed
File System 7. Complete
6. Close

4. Write Packet 5. ack Packet

Pipeline of DataNode DataNode DataNode


Data nodes
DataNode DataNode DataNode
Anatomy of A File Read

1. Open 2. Get Block locations


NameNode
HDFS Distributed
Client 3. Read File System
NameNode
6. Close FS Data
Input Stream
Client JVM

Client Node

4. Read 5. Read

DataNode DataNode DataNode

DataNode DataNode DataNode

Slide 21
Replication and Rack Awareness
Hadoop Cluster Architecture (Contd.)

Client

HDFS Map Reduce


Name Node Job Tracker

Data Node Data Node Task Tracker Task Tracker

Task Tracker Task Tracker Data Node Data Node

Data Node Data Node Task Tracker Task Tracker

Task Tracker Task Tracker Data Node Data Node


Hadoop Cluster: A Typical Use Case

Name Node Secondary Name Node


RAM: 32 GB,
RAM: 64 GB, Hard disk: 1 TB
Hard disk: 1 TB
Processor: Xenon with 4 Cores
Processor: Xenon with 8 Cores
Ethernet: 3 X 10 GB/s
Ethernet: 3 X 10 GB/s
OS: 64-bit CentOS OS: 64-bit CentOS
Power: Redundant Power Supply Power: Redundant Power Supply

Data Node Data Node


RAM: 16GB RAM: 16GB
Hard disk: 6 X 2TB Hard disk: 6 X 2TB
Processor: Xenon with 2 cores. Processor: Xenon with 2 cores.
Ethernet: 3 X 10 GB/s Ethernet: 3 X 10 GB/s
OS: 64-bit CentOS OS: 64-bit CentOS
Hadoop Cluster: Facebook
Hadoop Cluster Modes
Hadoop 1.x: Coré Configuration Files

Core core-site.xml

HDFS hdfs-site.xml

Map
Reduce mapred-site.xml
Hadoop Configuration Files
Configuration Description of Log Files
Filenames

hadoop-env.sh Environment variables that are used in the scripts to run Hadoop.

core-site.xml Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and
MapReduce.

hdfs-site.xml Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data
nodes.

mapred-site.xml Configuration settings for MapReduce daemons : the job-tracker and the task-trackers.

masters A list of machines (one per line) that each run a secondary namenode.

slaves A list of machines (one per line) that each run a datanode and a task-tracker.
core-site.xml and hdfs-site.xml
hdfs-site.xml core-site.xml

<?xml version - "1.0"?> <?xml version ="1.0"?>

<!--hdfs-site.xml--> <!--core-site.xml-->

<configuration> <configuration>

<property> <property>

<name>dfs.replication</name> <name>fs.default.name</name>

<value>1</value> <value>hdfs://localhost:8020/</value>

</property> </property>

</configuration> </configuration>
Defining HDFS Details In hdfs-site.xml

Property Value Description

dfs.data.dir <value> A list of directories where the datanode


/disk1/hdfs/data, stores blocks. Each block is stored in only
/disk2/hdfs/data one of these directories.
</value>
${hadoop.tmp.dir}/dfs/data

fs.checkpoint.dir <value> A list of directories where the secondary


/disk1/hdfs/namesecondary, namenode stores checkpoints. It stores a
/disk2/hdfs/namesecondary copy of the checkpoint in each directory in
</value> the list

${hadoop.tmp.dir}/dfs/namesecondary
Defining mapred-site.xml
Property Value Description
mapred.job.tracker <value> The hostname and the port that the jobtracker
localhost:8021 RPC server runs on. If set to the default value
</value> of local, then the jobtracker runs in-process on
demand when you run a MapReduce job.

mapred.local.dir ${hadoop.tmp.dir}/mapred/local A list of directories where MapReduce stores


intermediate data for jobs. The data is
cleared out when the job ends.
mapred.system.dir ${hadoop.tmp.dir}/mapred/syst The directory relative to fs.default.name
em where shared files are stored, during a job
run.
mapred.tasktracker.map.tasks 2 The number of map tasks that may be run on
. maximum a tasktracker at any one time

mapred.tasktracker.reduce.tasks 2 The number of reduce tasks tat may be run on


.maximum a tasktracker at any one time.
Slaves and Masters
Two files are used by the startup and shutdown commands:

Slaves

Contains a list of hosts, one per line, that are to host DataNode and
TaskTracker servers.

Masters

Contains a list of hosts, one per line, that are to host Secondary
NameNode servers.

All Properties

1. http://hadoop.apache.org/docs/r1.1.2/core-default.html
2. http://hadoop.apache.org/docs/r1.1.2/mapred-default.html
3. http://hadoop.apache.org/docs/r1.1.2/hdfs-default.html
Web UI URLs
NameNode status:
http://localhost:50070/dfshealth.jsp

JobTracker status:
http://localhost:50030/jobtracker.jsp

TaskTracker status:
http://localhost:50060/tasktracker.jsp

DataBlock Scanner Report:


http://localhost:50075/blockScannerReport
Coherency Model
• HDFS supports coherency model for a filesystem that describes the
data visibility of reads and writes for a file.
• After creating a file, it is visible in the filesystem namespace, as
expected: However, usually any content written to the file is not
guaranteed to be visible, even if the stream is flushed. So, the file
appears to have a length of zero until the file is written completely.
• In HDFS, Once more than a block’s worth of data has been written,
the first block will be visible to new readers. This is true of
subsequent blocks, too: it is always the current block being written
that is not visible to other readers.
• HDFS provides a way to force all buffers to be flushed to the
datanodes via the hflush() method on FSDataOutputStream. After a
successful return from hflush(), HDFS guarantees that the data
written up to that point in the file has reached all the datanodes in
the write pipeline and is visible to all new readers:
HDFS High Availability
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in
an HDFS cluster. Each cluster had a single NameNode, and if that machine or
process became unavailable, the cluster as a whole would be unavailable until
the NameNode was either restarted or brought up on a separate machine.
• This impacted the total availability of the HDFS cluster in two major ways:
In the case of an unplanned event such as a machine crash, the cluster would
be unavailable until an operator restarted the NameNode.
Planned maintenance events such as software or hardware upgrades on the
NameNode machine would result in windows of cluster downtime.
The HDFS High Availability feature addresses the above problems by providing
the option of running two redundant NameNodes in the same cluster in an
Active/Passive configuration with a hot standby.
• HDFS HA is possible by using:
NFS for shared storage
Quorum Journal Manager (QJM)
NFS-HA (Architecture)
• In a typical HA cluster, two separate machines are configured as NameNodes. At
any point in time, exactly one of the NameNodes is in an Active state, and the
other is in a Standby state. The Active NameNode is responsible for all client
operations in the cluster, while the Standby is simply acting as a slave, maintaining
enough state to provide a fast failover if necessary.
• When any namespace modification is performed by the Active node, it durably
logs a record of the modification to an edit log file stored in the shared directory.
The Standby node is constantly watching this directory for edits, and as it sees the
edits, it applies them to its own namespace. In the event of a failover, the Standby
will ensure that it has read all of the edits from the shared storage before
promoting itself to the Active state. This ensures that the namespace state is fully
synchronized before a failover occurs.
• In order to provide a fast failover, it is also necessary that the Standby node have
up-to-date information regarding the location of blocks in the cluster. In order to
achieve this, the DataNodes are configured with the location of both NameNodes,
and send block location information and heartbeats to both.
NFS for shared storage (Hardware resources)

• In order to deploy an HA cluster, you should prepare the following:

• NameNode machines - the machines on which you run the Active and Standby
NameNodes should have equivalent hardware to each other, and equivalent
hardware to what would be used in a non-HA cluster.
• Shared storage - you will need to have a shared directory which both NameNode
machines can have read/write access to. Typically this is a remote filer which
supports NFS and is mounted on each of the NameNode machines.
• Currently only a single shared edits directory is supported. Beacuse of this, it is
recommended that the shared storage server be a high-quality dedicated NAS
appliance rather than a simple Linux server.
• Note that, in an HA cluster, the Standby NameNode also performs checkpoints of
the namespace state, and thus it is not necessary to run a Secondary NameNode,
CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an
error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to
be HA-enabled to reuse the hardware which they had previously dedicated to the
Secondary NameNode.
QJM-HA (Architecture)
• In a typical HA cluster, two separate machines are configured as NameNodes. At
any point in time, exactly one of the NameNodes is in an Active state, and the
other is in a Standby state. The Active NameNode is responsible for all client
operations in the cluster, while the Standby is simply acting as a slave, maintaining
enough state to provide a fast failover if necessary.
• In order for the Standby node to keep its state synchronized with the Active node,
both nodes communicate with a group of separate daemons called
“JournalNodes” (JNs). When any namespace modification is performed by the
Active node, it durably logs a record of the modification to a majority of these JNs.
The Standby node is capable of reading the edits from the JNs, and is constantly
watching them for changes to the edit log. As the Standby Node sees the edits, it
applies them to its own namespace. In the event of a failover, the Standby will
ensure that it has read all of the edits from the JounalNodes before promoting
itself to the Active state. This ensures that the namespace state is fully
synchronized before a failover occurs.
• In order to provide a fast failover, it is also necessary that the Standby node have
up-to-date information regarding the location of blocks in the cluster. In order to
achieve this, the DataNodes are configured with the location of both NameNodes,
and send block location information and heartbeats to both.
QJM (Hardware resources)
• NameNode machines - the machines on which you run the Active and
Standby NameNodes should have equivalent hardware to each other, and
equivalent hardware to what would be used in a non-HA cluster.
• JournalNode machines - the machines on which you run the
JournalNodes. The JournalNode daemon is relatively lightweight, so these
daemons may reasonably be collocated on machines with other Hadoop
daemons, for example NameNodes, the JobTracker, or the YARN
ResourceManager. Note: There must be at least 3 JournalNode daemons,
since edit log modifications must be written to a majority of JNs. This will
allow the system to tolerate the failure of a single machine. You may also
run more than 3 JournalNodes, but in order to actually increase the
number of failures the system can tolerate, you should run an odd number
of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the
system can tolerate at most (N - 1) / 2 failures and continue to function
normally.
Automatic failover
• In HA mode, the system will not automatically trigger a failover from the active to
the standby NameNode, even if the active node has failed.
• Automatic failover adds two new components to an HDFS deployment: a
ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC).
• Apache ZooKeeper is a highly available service for maintaining small amounts of
coordination data, notifying clients of changes in that data, and monitoring clients
for failures. The implementation of automatic HDFS failover relies on ZooKeeper
for the following things:
• Failure detection - each of the NameNode machines in the cluster maintains a
persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will
expire, notifying the other NameNode that a failover should be triggered.
• Active NameNode election - ZooKeeper provides a simple mechanism to
exclusively elect a node as active. If the current active NameNode crashes, another
node may take a special exclusive lock in ZooKeeper indicating that it should
become the next active.
• The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client
which also monitors and manages the state of the NameNode.
Practice Session
HDFS Commands
• Print the Hadoop version
hadoop version
• List the contents of the root directory in HDFS
hadoop fs -ls /
• Report the amount of space used and
available on currently mounted filesystem
hadoop fs -df hdfs:/
HDFS Commands
• Count number of directories, files and bytes
for paths that match specified file pattern
hadoop fs -count hdfs:/
• Run a DFS filesystem checking utility
hadoop fsck – /
• Run a cluster balancing utility
hadoop balancer
HDFS Commands
• Create a new directory named “hadoop”
below the /user/training directory in HDFS.
hadoop fs -mkdir /user/training/hadoop
• Add sample text file from local directory
named “data” to new directory created in
HDFS
hadoop fs -put data/sample.txt
/user/training/hadoop
HDFS Commands
• List the contents of this new directory in HDFS.
hadoop fs -ls /user/training/hadoop
• Add the entire local directory called “retail” to
the /user/training directory in HDFS.
hadoop fs -put data/retail /user/training/hadoop
• See how much space this directory occupies in
HDFS.
hadoop fs -du -s -h hadoop/retail
HDFS Commands
• Move a directory from one location to other
hadoop fs -mv hadoop apache_hadoop
• Delete a file ‘customers’ from the “retail”
directory.
hadoop fs -rm hadoop/retail/customers
• Ensure this file is no longer in HDFS.
hadoop fs -ls hadoop/retail/customers
• Delete all files from the “retail” directory using a
wildcard.
hadoop fs -rm hadoop/retail/*
HDFS Commands
• To empty the trash
hadoop fs –expunge
• Finally, remove the entire retail directory and all
of its contents in HDFS.
hadoop fs -rm -r hadoop/retail
• Add the purchases.txt file from the local directory
named “/home/training/” to the hadoop
directory you created in HDFS
hadoop fs -copyFromLocal
/home/training/purchases.txt hadoop/
HDFS Commands
• To view contents of your text file purchases.txt
which is present in your hadoop directory.
hadoop fs -cat hadoop/purchases.txt
• Add purchases.txt file from HDFS directory to the
local directory
hadoop fs -copyToLocal hadoop/purchases.txt
/home/training/data
• Use ‘-chown’ to change owner name and group
name simultaneously
sudo -u hdfs hadoop fs -chown root:root
hadoop/purchases.txt
HDFS Commands
• Use ‘-setrep’ command to change replication
factor of a file (default -3)
hadoop fs -setrep -w 2 apache_hadoop/sample.txt
• Command to make the name node leave safe
mode
hadoop fs -expunge
sudo -u hdfs hdfs dfsadmin -safemode leave
HDFS Commands
Parallel Copying with distcp
• The HDFS access patterns that we have seen so far focus on single-
threaded access.
• Hadoop comes with a useful program called distcp for copying data to and
from Hadoop filesystems in parallel.
• distcp is as an efficient replacement for hadoop fs -cp.
• Example,
hadoop distcp file1 file2
• You can also copy directories:
hadoop distcp dir1 dir2
(If dir2 does not exist, it will be created, and the contents of the dir1
directory will be copied there. You can specify multiple source paths, and all
will be copied to the destination. If dir2 already exists, then dir1 will be
copied under it, creating the directory structure dir2/dir1.)
Thank You

• Question?
• Feedback?

[email protected]

You might also like