0% found this document useful (0 votes)

65 views51 pages

Big Data

Big data refers to large, complex datasets that are difficult to process using traditional database management tools or data processing applications. The challenges include capturing, storing, searching, sharing, transferring, analyzing, and visualizing large amounts of data. Hadoop is an open-source framework that allows distributed processing of large datasets across clusters of computers. It uses HDFS for storage and MapReduce as its processing layer. HDFS stores data across clusters as blocks and uses namenodes and datanodes to manage the file system and storage.

Uploaded by

pratiksha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views51 pages

Big Data

Uploaded by

pratiksha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

What is Big Data?

Lots of Data (Terabytes or Petabytes)

Big data is the term for a collection of data sets so large and complex that
it becomes difficult to process using on-hand database management
tools or traditional data processing applications.
The challenges include capture, storage, search, sharing, transfer,
analysis, and visualization.
NYSE generates about one terabyte of new trade data per day to perform
stock trading analytics to determine trends for optimal trades.
3V’s – Big Data

Web Audios
logs
Images
Videos

Sensor
Data

Volume Velocity Variety

Veracity
Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and
mined meaningful to the problem being analyzed. In scoping out your big data strategy one need to keep
the data clean and processes to keep ‘dirty data’ from accumulating in your systems.
Common Big Data Customer Scenarios
Web and e-tailing Healthcare & Life Sciences
Recommendation Engines Health Information Exchange
Ad Targeting Gene Sequencing
Search Quality Serialization
Abuse and Click Fraud Detection Healthcare Service Quality
Telecommunications Improvements
Customer Churn Prevention Drug Safety
Network Performance Optimization Banks and Financial services
Calling Data Record (CDR) Analysis Modeling True Risk
Analyzing Network to Predict Failure Threat Analysis
Government Fraud Detection
Fraud Detection and Cyber Security Trade Surveillance
Welfare Schemes Credit Scoring and Analysis
Justice Retail
Point of Sales Transaction Analysis
Customer Churn Analysis
Sentiment Analysis
Hidden Treasure
Insight into data can provide • Case Study: Sears Holding Corporation

Business Advantage.
Some key early indicators can
mean Fortunes to Business.
More Precise Analysis with
more data. X
Sears was using traditional
systems such as Oracle
Exadata, Teradata and SAS
etc. to store and process the
customer activity and sales
data.
Limitations of Existing Data Analytics
Architecture

BI Reports + Interactive Apps

1. Can’t explore original
A meagre high fidelity raw data Processing
10% of the RDBMS (Aggregated Data)
~2PB Data is
available for
BI ETL Compute Grid
90% of
2. Moving data to compute the ~2PB
Archived
doesn’t scale
Storage only Grid (original Raw Data)
3. Premature data
Storage death
Mostly Append
Collection

Instrumentation

http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
Solution: A Combined Storage
Computer Layer
BI Reports + Interactive Apps 1. Data Exploration &
Advanced analytics
RDBMS (Aggregated Data)
No Data
Archiving
Entire ~2PB 2. Scalable throughput for ETL &
Data is aggregation
available for
processing Hadoop : Storage + Compute Grid 3. Keep data alive
forever

Mostly Append
Both
Storage Collection
And
Processing Instrumentation

*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather
than a meagre 10% as was the case with existing Non-Hadoop solutions.
Hadoop Differentiating Factors

Accessible

Simple Differentiating Robust

Factors

Scalable
Why DFS?
Read 1 TB Data

1 Machine 10 Machines
4 I/O Channels
4 I/O Channels
Each Channel – 100 MB/s
Each Channel – 100 MB/s

45 Minutes 4.5 Minutes

What is Hadoop?
Apache Hadoop is a
framework that allows
for the distributed
processing of large data
sets across clusters of
commodity computers
using a simple
programming model.

It is an Open-source
Data Management with
scale-out storage &
distributed processing.
Hadoop Eco-System

Apache Oozie (Workflow)

Hive Pig Latin Mahout

DW System Data Analysis Machine Learning

MapReduce Framework
HBase
HDFS (Hadoop Distributed File System)

Flume Sqoop

Import Or Export

Unstructured or
Semi-Structured data Structured Data
Hadoop Core Components

MapReduce Job Tracker Task Task Task Task

Engine Tracker Tracker Tracker Tracker

HDFS Admin Node Data Node Data Node Data Node Data Node
Cluster Name node
Hadoop Core Components
• Hadoop is a system for large scale data processing.

It has two main components:

HDFS – Hadoop Distributed File System (Storage)
Distributed across “nodes”
Natively redundant
NameNode tracks locations.

MapReduce (Processing)
Splits a task across processors
“near” the data & assembles results
Self-Healing, High Bandwidth
Clustered storage
JobTracker manages the TaskTrackers
HDFS Definition
• HDFS is a distributed, scalable, and portable filesystem written in Java for
the Hadoop framework.
• HDFS creates multiple replicas of data blocks and distributes them on
compute nodes throughout a cluster to enable reliable, extremely rapid
computations.
• HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.
• HDFS provides high throughput access to application data and is suitable
for applications that have large data sets
• HDFS consists of following components (daemons)
– Name Node
– Data Node
– Secondary Name Node
HDFS Components
Namenode:
•NameNode, a master server, manages the file system
namespace and regulates access to files by clients.
•Maintains and Manages the blocks which are present on
the datanode.
Meta-data in Memory
• – The entire metadata is in main memory
Types of Metadata
• – List of files
• – List of Blocks for each file
• – List of DataNodes for each block
• – File attributes, e.g. creation time, replication
A Transaction Log
• – Records file creations, file deletions. Etc.
HDFS Components
Data Node:
•DataNodes, one per node in the cluster, manages
storage attached to the nodes that they run on.
A Block Server
Stores data in the local file system (e.g. ext3)
Stores meta-data of a block (e.g. CRC)
Serves data and meta-data to Clients
Block Report
Periodically sends a report of all existing blocks to
the NameNode
Facilitates Pipelining of Data
Forwards data to other specified DataNodes
Understanding the File system
Block placement
• Current Strategy
One replica on local node
Second replica on a remote rack
Third replica on same remote rack
Additional replicas are randomly placed
• Clients read from nearest replica

Data Correctness
• Use Checksums to validate data
Use CRC32
• File Creation
Client computes checksum per 512 byte
DataNode stores the checksum
• File access
Client retrieves the data and checksum from DataNode
If Validation fails, Client tries other replicas
Understanding the File system
Data pipelining
Client retrieves a list of DataNodes on which to place replicas of a block
Client writes block to the first DataNode
The first DataNode forwards the data to the next DataNode in the Pipeline
When all replicas are written, the Client moves on to write the next block in file

Rebalancer
• Goal: % of disk occupied on Datanodes should be similar
Usually run when new Datanodes are added
Cluster is online when Rebalancer is active
Rebalancer is throttled to avoid network congestion
Command line tool
Secondary NameNode
Secondary NameNode:
Not a hot standby for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NameNode metadata
Saved metadata can build a failed NameNode
HDFS Architecture

Metadata ops NameNode Metadata (Name, replicas,…)

Client
Block ops
Read DataNodes DataNodes

Replication Blocks

Write
Rack 2
Rack 1 Client
Anatomy of A File Write

2. Create
1. Create NameNode
HDFS
Client 3. Write Distributed
File System 7. Complete
6. Close

4. Write Packet 5. ack Packet

Pipeline of DataNode DataNode DataNode

Data nodes
DataNode DataNode DataNode
Anatomy of A File Read

1. Open 2. Get Block locations

NameNode
HDFS Distributed
Client 3. Read File System
NameNode
6. Close FS Data
Input Stream
Client JVM

Client Node

4. Read 5. Read

DataNode DataNode DataNode

Slide 21
Replication and Rack Awareness
Hadoop Cluster Architecture (Contd.)

Client

HDFS Map Reduce

Name Node Job Tracker

Data Node Data Node Task Tracker Task Tracker

Task Tracker Task Tracker Data Node Data Node

Data Node Data Node Task Tracker Task Tracker

Task Tracker Task Tracker Data Node Data Node

Hadoop Cluster: A Typical Use Case

Name Node Secondary Name Node

RAM: 32 GB,
RAM: 64 GB, Hard disk: 1 TB
Hard disk: 1 TB
Processor: Xenon with 4 Cores
Processor: Xenon with 8 Cores
Ethernet: 3 X 10 GB/s
Ethernet: 3 X 10 GB/s
OS: 64-bit CentOS OS: 64-bit CentOS
Power: Redundant Power Supply Power: Redundant Power Supply

Data Node Data Node

RAM: 16GB RAM: 16GB
Hard disk: 6 X 2TB Hard disk: 6 X 2TB
Processor: Xenon with 2 cores. Processor: Xenon with 2 cores.
Ethernet: 3 X 10 GB/s Ethernet: 3 X 10 GB/s
OS: 64-bit CentOS OS: 64-bit CentOS
Hadoop Cluster: Facebook
Hadoop Cluster Modes
Hadoop 1.x: Coré Configuration Files

Core core-site.xml

HDFS hdfs-site.xml

Map
Reduce mapred-site.xml
Hadoop Configuration Files
Configuration Description of Log Files
Filenames

hadoop-env.sh Environment variables that are used in the scripts to run Hadoop.

core-site.xml Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and
MapReduce.

hdfs-site.xml Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data
nodes.

mapred-site.xml Configuration settings for MapReduce daemons : the job-tracker and the task-trackers.

masters A list of machines (one per line) that each run a secondary namenode.

slaves A list of machines (one per line) that each run a datanode and a task-tracker.
core-site.xml and hdfs-site.xml
hdfs-site.xml core-site.xml

<?xml version - "1.0"?> <?xml version ="1.0"?>

<name>dfs.replication</name> <name>fs.default.name</name>

<value>1</value> <value>hdfs://localhost:8020/</value>

</property> </property>

</configuration> </configuration>
Defining HDFS Details In hdfs-site.xml

Property Value Description

dfs.data.dir <value> A list of directories where the datanode

/disk1/hdfs/data, stores blocks. Each block is stored in only
/disk2/hdfs/data one of these directories.
</value>
${hadoop.tmp.dir}/dfs/data

fs.checkpoint.dir <value> A list of directories where the secondary

/disk1/hdfs/namesecondary, namenode stores checkpoints. It stores a
/disk2/hdfs/namesecondary copy of the checkpoint in each directory in
</value> the list

${hadoop.tmp.dir}/dfs/namesecondary
Defining mapred-site.xml
Property Value Description
mapred.job.tracker <value> The hostname and the port that the jobtracker
localhost:8021 RPC server runs on. If set to the default value
</value> of local, then the jobtracker runs in-process on
demand when you run a MapReduce job.

mapred.local.dir ${hadoop.tmp.dir}/mapred/local A list of directories where MapReduce stores

intermediate data for jobs. The data is
cleared out when the job ends.
mapred.system.dir ${hadoop.tmp.dir}/mapred/syst The directory relative to fs.default.name
em where shared files are stored, during a job
run.
mapred.tasktracker.map.tasks 2 The number of map tasks that may be run on
. maximum a tasktracker at any one time

mapred.tasktracker.reduce.tasks 2 The number of reduce tasks tat may be run on

.maximum a tasktracker at any one time.
Slaves and Masters
Two files are used by the startup and shutdown commands:

Slaves

Contains a list of hosts, one per line, that are to host DataNode and
TaskTracker servers.

Masters

Contains a list of hosts, one per line, that are to host Secondary
NameNode servers.

All Properties

1. http://hadoop.apache.org/docs/r1.1.2/core-default.html
2. http://hadoop.apache.org/docs/r1.1.2/mapred-default.html
3. http://hadoop.apache.org/docs/r1.1.2/hdfs-default.html
Web UI URLs
NameNode status:
http://localhost:50070/dfshealth.jsp

JobTracker status:
http://localhost:50030/jobtracker.jsp

TaskTracker status:
http://localhost:50060/tasktracker.jsp

DataBlock Scanner Report:

http://localhost:50075/blockScannerReport
Coherency Model
• HDFS supports coherency model for a filesystem that describes the
data visibility of reads and writes for a file.
• After creating a file, it is visible in the filesystem namespace, as
expected: However, usually any content written to the file is not
guaranteed to be visible, even if the stream is flushed. So, the file
appears to have a length of zero until the file is written completely.
• In HDFS, Once more than a block’s worth of data has been written,
the first block will be visible to new readers. This is true of
subsequent blocks, too: it is always the current block being written
that is not visible to other readers.
• HDFS provides a way to force all buffers to be flushed to the
datanodes via the hflush() method on FSDataOutputStream. After a
successful return from hflush(), HDFS guarantees that the data
written up to that point in the file has reached all the datanodes in
the write pipeline and is visible to all new readers:
HDFS High Availability
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in
an HDFS cluster. Each cluster had a single NameNode, and if that machine or
process became unavailable, the cluster as a whole would be unavailable until
the NameNode was either restarted or brought up on a separate machine.
• This impacted the total availability of the HDFS cluster in two major ways:
In the case of an unplanned event such as a machine crash, the cluster would
be unavailable until an operator restarted the NameNode.
Planned maintenance events such as software or hardware upgrades on the
NameNode machine would result in windows of cluster downtime.
The HDFS High Availability feature addresses the above problems by providing
the option of running two redundant NameNodes in the same cluster in an
Active/Passive configuration with a hot standby.
• HDFS HA is possible by using:
NFS for shared storage
Quorum Journal Manager (QJM)
NFS-HA (Architecture)
• In a typical HA cluster, two separate machines are configured as NameNodes. At
any point in time, exactly one of the NameNodes is in an Active state, and the
other is in a Standby state. The Active NameNode is responsible for all client
operations in the cluster, while the Standby is simply acting as a slave, maintaining
enough state to provide a fast failover if necessary.
• When any namespace modification is performed by the Active node, it durably
logs a record of the modification to an edit log file stored in the shared directory.
The Standby node is constantly watching this directory for edits, and as it sees the
edits, it applies them to its own namespace. In the event of a failover, the Standby
will ensure that it has read all of the edits from the shared storage before
promoting itself to the Active state. This ensures that the namespace state is fully
synchronized before a failover occurs.
• In order to provide a fast failover, it is also necessary that the Standby node have
up-to-date information regarding the location of blocks in the cluster. In order to
achieve this, the DataNodes are configured with the location of both NameNodes,
and send block location information and heartbeats to both.
NFS for shared storage (Hardware resources)

• In order to deploy an HA cluster, you should prepare the following:

• NameNode machines - the machines on which you run the Active and Standby
NameNodes should have equivalent hardware to each other, and equivalent
hardware to what would be used in a non-HA cluster.
• Shared storage - you will need to have a shared directory which both NameNode
machines can have read/write access to. Typically this is a remote filer which
supports NFS and is mounted on each of the NameNode machines.
• Currently only a single shared edits directory is supported. Beacuse of this, it is
recommended that the shared storage server be a high-quality dedicated NAS
appliance rather than a simple Linux server.
• Note that, in an HA cluster, the Standby NameNode also performs checkpoints of
the namespace state, and thus it is not necessary to run a Secondary NameNode,
CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an
error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to
be HA-enabled to reuse the hardware which they had previously dedicated to the
Secondary NameNode.
QJM-HA (Architecture)
• In a typical HA cluster, two separate machines are configured as NameNodes. At
any point in time, exactly one of the NameNodes is in an Active state, and the
other is in a Standby state. The Active NameNode is responsible for all client
operations in the cluster, while the Standby is simply acting as a slave, maintaining
enough state to provide a fast failover if necessary.
• In order for the Standby node to keep its state synchronized with the Active node,
both nodes communicate with a group of separate daemons called
“JournalNodes” (JNs). When any namespace modification is performed by the
Active node, it durably logs a record of the modification to a majority of these JNs.
The Standby node is capable of reading the edits from the JNs, and is constantly
watching them for changes to the edit log. As the Standby Node sees the edits, it
applies them to its own namespace. In the event of a failover, the Standby will
ensure that it has read all of the edits from the JounalNodes before promoting
itself to the Active state. This ensures that the namespace state is fully
synchronized before a failover occurs.
• In order to provide a fast failover, it is also necessary that the Standby node have
up-to-date information regarding the location of blocks in the cluster. In order to
achieve this, the DataNodes are configured with the location of both NameNodes,
and send block location information and heartbeats to both.
QJM (Hardware resources)
• NameNode machines - the machines on which you run the Active and
Standby NameNodes should have equivalent hardware to each other, and
equivalent hardware to what would be used in a non-HA cluster.
• JournalNode machines - the machines on which you run the
JournalNodes. The JournalNode daemon is relatively lightweight, so these
daemons may reasonably be collocated on machines with other Hadoop
daemons, for example NameNodes, the JobTracker, or the YARN
ResourceManager. Note: There must be at least 3 JournalNode daemons,
since edit log modifications must be written to a majority of JNs. This will
allow the system to tolerate the failure of a single machine. You may also
run more than 3 JournalNodes, but in order to actually increase the
number of failures the system can tolerate, you should run an odd number
of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the
system can tolerate at most (N - 1) / 2 failures and continue to function
normally.
Automatic failover
• In HA mode, the system will not automatically trigger a failover from the active to
the standby NameNode, even if the active node has failed.
• Automatic failover adds two new components to an HDFS deployment: a
ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC).
• Apache ZooKeeper is a highly available service for maintaining small amounts of
coordination data, notifying clients of changes in that data, and monitoring clients
for failures. The implementation of automatic HDFS failover relies on ZooKeeper
for the following things:
• Failure detection - each of the NameNode machines in the cluster maintains a
persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will
expire, notifying the other NameNode that a failover should be triggered.
• Active NameNode election - ZooKeeper provides a simple mechanism to
exclusively elect a node as active. If the current active NameNode crashes, another
node may take a special exclusive lock in ZooKeeper indicating that it should
become the next active.
• The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client
which also monitors and manages the state of the NameNode.
Practice Session
HDFS Commands
• Print the Hadoop version
hadoop version
• List the contents of the root directory in HDFS
hadoop fs -ls /
• Report the amount of space used and
available on currently mounted filesystem
hadoop fs -df hdfs:/
HDFS Commands
• Count number of directories, files and bytes
for paths that match specified file pattern
hadoop fs -count hdfs:/
• Run a DFS filesystem checking utility
hadoop fsck – /
• Run a cluster balancing utility
hadoop balancer
HDFS Commands
• Create a new directory named “hadoop”
below the /user/training directory in HDFS.
hadoop fs -mkdir /user/training/hadoop
• Add sample text file from local directory
named “data” to new directory created in
HDFS
hadoop fs -put data/sample.txt
/user/training/hadoop
HDFS Commands
• List the contents of this new directory in HDFS.
hadoop fs -ls /user/training/hadoop
• Add the entire local directory called “retail” to
the /user/training directory in HDFS.
hadoop fs -put data/retail /user/training/hadoop
• See how much space this directory occupies in
HDFS.
hadoop fs -du -s -h hadoop/retail
HDFS Commands
• Move a directory from one location to other
hadoop fs -mv hadoop apache_hadoop
• Delete a file ‘customers’ from the “retail”
directory.
hadoop fs -rm hadoop/retail/customers
• Ensure this file is no longer in HDFS.
hadoop fs -ls hadoop/retail/customers
• Delete all files from the “retail” directory using a
wildcard.
hadoop fs -rm hadoop/retail/*
HDFS Commands
• To empty the trash
hadoop fs –expunge
• Finally, remove the entire retail directory and all
of its contents in HDFS.
hadoop fs -rm -r hadoop/retail
• Add the purchases.txt file from the local directory
named “/home/training/” to the hadoop
directory you created in HDFS
hadoop fs -copyFromLocal
/home/training/purchases.txt hadoop/
HDFS Commands
• To view contents of your text file purchases.txt
which is present in your hadoop directory.
hadoop fs -cat hadoop/purchases.txt
• Add purchases.txt file from HDFS directory to the
local directory
hadoop fs -copyToLocal hadoop/purchases.txt
/home/training/data
• Use ‘-chown’ to change owner name and group
name simultaneously
sudo -u hdfs hadoop fs -chown root:root
hadoop/purchases.txt
HDFS Commands
• Use ‘-setrep’ command to change replication
factor of a file (default -3)
hadoop fs -setrep -w 2 apache_hadoop/sample.txt
• Command to make the name node leave safe
mode
hadoop fs -expunge
sudo -u hdfs hdfs dfsadmin -safemode leave
HDFS Commands
Parallel Copying with distcp
• The HDFS access patterns that we have seen so far focus on single-
threaded access.
• Hadoop comes with a useful program called distcp for copying data to and
from Hadoop filesystems in parallel.
• distcp is as an efficient replacement for hadoop fs -cp.
• Example,
hadoop distcp file1 file2
• You can also copy directories:
hadoop distcp dir1 dir2
(If dir2 does not exist, it will be created, and the contents of the dir1
directory will be copied there. You can specify multiple source paths, and all
will be copied to the destination. If dir2 already exists, then dir1 will be
copied under it, creating the directory structure dir2/dir1.)
Thank You

• Question?
• Feedback?

[email protected]

The Restaurant Management System Using Net
No ratings yet
The Restaurant Management System Using Net
123 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
BAED-AI2121-2322S-Written Work 2 4th Quarter Grade 12
100% (1)
BAED-AI2121-2322S-Written Work 2 4th Quarter Grade 12
5 pages
Major Project - Grid Solving Robot
57% (7)
Major Project - Grid Solving Robot
84 pages
Simple Arduino Inverter Circuit
100% (1)
Simple Arduino Inverter Circuit
11 pages
Unit 1,2,3,4
No ratings yet
Unit 1,2,3,4
116 pages
IBM Informix Backup and Restore Guide
No ratings yet
IBM Informix Backup and Restore Guide
391 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
Unit-5 SM
No ratings yet
Unit-5 SM
32 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Mini Project
No ratings yet
Mini Project
25 pages
Unit 5
No ratings yet
Unit 5
101 pages
Computer Applications
No ratings yet
Computer Applications
217 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
Lec 5 - Big Data Storage Technologies I - Hadoop
No ratings yet
Lec 5 - Big Data Storage Technologies I - Hadoop
44 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Unit I
No ratings yet
Unit I
38 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
Rama 88203 06011181621075
No ratings yet
Rama 88203 06011181621075
108 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
HDFS
No ratings yet
HDFS
20 pages
4
No ratings yet
4
53 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
24 pages
The CW Checks The Following Learning Outcomes:: 7ECON012C, Data Analytics
No ratings yet
The CW Checks The Following Learning Outcomes:: 7ECON012C, Data Analytics
3 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Securing Nomad
No ratings yet
Securing Nomad
28 pages
DATA228 Lecture Notes Week 4
No ratings yet
DATA228 Lecture Notes Week 4
21 pages
10th August Morning and Afternoon Session Hadoop
No ratings yet
10th August Morning and Afternoon Session Hadoop
18 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
HCIA Big Data
No ratings yet
HCIA Big Data
20 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Big Data
No ratings yet
Big Data
51 pages
Cloud Computing - Unit 3
No ratings yet
Cloud Computing - Unit 3
38 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
45 pages
ACE6000 User Guide - FW - V2 65bis
No ratings yet
ACE6000 User Guide - FW - V2 65bis
93 pages
CH-05 CC
No ratings yet
CH-05 CC
21 pages
Hadoop Class 1 PDF
No ratings yet
Hadoop Class 1 PDF
27 pages
Blender Tutorials-1
No ratings yet
Blender Tutorials-1
1 page
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Mod 5 Simp
No ratings yet
Mod 5 Simp
8 pages
Inst Requirements novAA800series
No ratings yet
Inst Requirements novAA800series
15 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
Swordfish
No ratings yet
Swordfish
86 pages
Darkweb Monitoring Report
No ratings yet
Darkweb Monitoring Report
6 pages
BDC - Batch Input
No ratings yet
BDC - Batch Input
52 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
3 HDFS
No ratings yet
3 HDFS
16 pages
Klayman Et Al v. Obama Et Al Opinion
No ratings yet
Klayman Et Al v. Obama Et Al Opinion
68 pages
Wa Introhdfs PDF
No ratings yet
Wa Introhdfs PDF
11 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
LG 20LS5R Chassis LP68A PDF
No ratings yet
LG 20LS5R Chassis LP68A PDF
30 pages
Debugging 9
No ratings yet
Debugging 9
16 pages
DS Assignment-2 (20csu073)
No ratings yet
DS Assignment-2 (20csu073)
21 pages
HV Turbo Aeration Compressor Service: Engineering
No ratings yet
HV Turbo Aeration Compressor Service: Engineering
3 pages
ITECH - SAS1000 Software Installation Instruction-EN
No ratings yet
ITECH - SAS1000 Software Installation Instruction-EN
10 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
6 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Hadoop Training in Bangalore
No ratings yet
Hadoop Training in Bangalore
31 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
EC3032 - Power Electronics
No ratings yet
EC3032 - Power Electronics
6 pages
Modern Work Plan Comparison Enterprise
No ratings yet
Modern Work Plan Comparison Enterprise
10 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
DhrubaBorthakur-Hadoop File Systems
No ratings yet
DhrubaBorthakur-Hadoop File Systems
25 pages
Log
No ratings yet
Log
5 pages
Uwell Caliburn AK2 Replacement Pods - India Vape
No ratings yet
Uwell Caliburn AK2 Replacement Pods - India Vape
1 page
How To Log Defects
No ratings yet
How To Log Defects
6 pages
New Language Line Job Infographic
No ratings yet
New Language Line Job Infographic
1 page
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

Big Data

Uploaded by

Big Data

Uploaded by

What is Big Data?

Lots of Data (Terabytes or Petabytes)

Volume Velocity Variety

BI Reports + Interactive Apps

Simple Differentiating Robust

45 Minutes 4.5 Minutes

Apache Oozie (Workflow)

Hive Pig Latin Mahout

MapReduce Job Tracker Task Task Task Task

It has two main components:

Metadata ops NameNode Metadata (Name, replicas,…)

4. Write Packet 5. ack Packet

Pipeline of DataNode DataNode DataNode

1. Open 2. Get Block locations

DataNode DataNode DataNode

DataNode DataNode DataNode

HDFS Map Reduce

Data Node Data Node Task Tracker Task Tracker

Task Tracker Task Tracker Data Node Data Node

Data Node Data Node Task Tracker Task Tracker

Task Tracker Task Tracker Data Node Data Node

Name Node Secondary Name Node

Data Node Data Node

<?xml version - "1.0"?> <?xml version ="1.0"?>

Property Value Description

dfs.data.dir <value> A list of directories where the datanode

fs.checkpoint.dir <value> A list of directories where the secondary

mapred.local.dir ${hadoop.tmp.dir}/mapred/local A list of directories where MapReduce stores

mapred.tasktracker.reduce.tasks 2 The number of reduce tasks tat may be run on

DataBlock Scanner Report:

• In order to deploy an HA cluster, you should prepare the following:

You might also like