0% found this document useful (0 votes)

59 views9 pages

Unit-II (BIG DATA)

The document discusses Google's File System (GFS), which is designed to store and process huge files across large clusters of computers. The GFS breaks files into 64MB chunks that are replicated across multiple chunkservers for reliability. A single master server coordinates the cluster by tracking metadata and chunk locations. Clients communicate directly with chunkservers to read and write data chunks, avoiding bottlenecks at the master. The GFS uses this distributed design to efficiently process big data across thousands of machines.

Uploaded by

gunturanusha88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views9 pages

Unit-II (BIG DATA)

Uploaded by

gunturanusha88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Working with Big Data [UNIT-II]

 In a way, big data is exactly what it sounds like -- a lot of data. Since the advent of the Internet, we've been
producing data in staggering amounts. It's been estimated that in all the time leading up to the year 2003,
only 5 exabytes of data were generated -- that's equal to 5 billion gigabytes.
 But from 2003 to 2012, the amount reached around 2.7 zettabytes (or 2,700 exabytes, or 2.7 trillion
gigabytes.
 According to Berkeley researchers, we are now producing roughly 5 quintillion bytes (or around 4.3
exabytes) of data every two days.
 The term 'big data' is usually used to refer to massive, rapidly expanding, varied and often unstructured sets
of digitized data that are difficult to maintain using traditional databases.
 A distributed file system is a client/server-based application that allows clients to access and process data
stored on the server as if it were on their own computer. When a user accesses a file on the server, the
server sends the user a copy of the file, which is cached on the user's computer while the data is being
processed and is then returned to the server.
 more than one client may access the same data simultaneously, the server must have a mechanism in place
(such as maintaining information about the times of access) to organize updates so that the client always
receives the most current version of data and that data conflicts do not arise. Distributed file systems
typically use file or database replication (distributing copies of data on multiple servers) to protect against
data access failures.
 Sun Microsystems' Network File System (NFS), Novell NetWare, Microsoft's Distributed File System, and
IBM/Transarc's DFS are some examples of distributed file systems.
 Google is a multi-billion dollar company. It's one of the big power players on the World Wide Web and
beyond. The company relies on a distributed computing system to provide users with the infrastructure
they need to access, create and alter data.
 Google uses the GFS to organize and manipulate huge files and to allow application developers the
research and development resources they require. The GFS is unique to Google and isn't for sale.

The Google File System (One master and multiple chunkservers)

 The challenge for the GFS team was to not only create an automatic monitoring system, but also to design
it so that it could work across a huge network of computers.
 The GFS team decided that users would have access to basic file commands. These include commands like
open, create, read, write and close files. The team also included a couple of specialized commands:
append and snapshot.
 A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients,
 Files on the GFS tend to be very large, usually in the multi-gigabyte (GB) range. Accessing and
manipulating files that large would take up a lot of the network's bandwidth.
 Bandwidth is the capacity of a system to move data from one location to another.
 The GFS addresses this problem by breaking files up into chunks of 64 megabytes (MB) each. Every chunk
receives a unique 64-bit identification number called a chunk handle assigned by the master at the time of
chunk creation.
 While the GFS can process smaller files, its developers didn't optimize the system for those kinds of tasks.
 Google organized the GFS into clusters of computers. A cluster is simply a network of computers. Each
cluster might contain hundreds or even thousands of machines. Within GFS clusters there are three kinds of
entities: clients, master servers and chunkservers.
 In the world of GFS, the term "client" refers to any entity that makes a file request. Requests can range
from retrieving and manipulating existing files to creating new files on the system. Clients can be other
computers or computer applications. You can think of clients as the customers of the GFS.
 The master server acts as the coordinator for the cluster. The master's duties include maintaining an
operation log, which keeps track of the activities of the master's cluster. The operation log helps keep
service interruptions to a minimum -- if the master server crashes, a replacement server that has monitored
the operation log can take its place. The master server also keeps track of metadata, which is the
information that describes chunks. The metadata tells the master server to which files the chunks belong
and where they fit within the overall file. Upon startup, the master polls all the chunkservers in its cluster.
The chunkservers respond by telling the master server the contents of their inventories. From that moment
on, the master server keeps track of the location of chunks within the cluster.
 There's only one active master server per cluster at any one time (though each cluster has multiple copies of
the master server in case of a hardware failure). That might sound like a good recipe for a bottleneck --
after all, if there's only one machine coordinating a cluster of thousands of computers, wouldn't that cause
data traffic jams? The GFS gets around this sticky situation by keeping the messages the master server
sends and receives very small. The master server doesn't actually handle file data at all. It leaves that up to
the chunkservers
 Chunkservers are the workhorses of the GFS. They're responsible for storing the 64-MB file chunks. The
chunkservers don't send chunks to the master server. Instead, they send requested chunks directly to the
client. The GFS copies every chunk multiple times and stores it on different chunkservers. Each copy is
called a replica. By default, the GFS makes three replicas per chunk, but users can change the setting and
make more or fewer replicas if desired.
 How do these elements work together during a routine process? Find out in the next section.

 Chunkservers store chunks on local disks as Linux files and read or write chunkda ta specified by a chunkha
ndle and byte range.
 For reliability, each chunkis replicated on multiple chunkservers.
 By default, we store three replicas, though users can designate different replication levels for different
regions of the file namespace.
Single Master
 The master maintains all file system metadata. This includes the namespace, access control information, the
mapping from files to chunks, and the current locations of chunks.It also controls system-wide activities
such as chunklease management, garbage collection of orphaned chunks, and chunkmigration between
chunkservers. The master periodically communicates with each chunkserver in HeartBeat messages to give
it instructions and collect its state.
 GFS client code linked into each application implements the file system API and communicates with the
master and chunkservers to read or write data on behalf of the application. Clients interact with the master
for metadata operations, but all data-bearing communication goes directly to the chunkservers.
 Neither the client nor the chunkserver caches file data.
 Clients never read and write file data through the master. Instead, a client asks the master which
chunkservers it should contact. It caches this information for a limited time and interacts with the
chunkservers directly for many subsequent operations.
 The GFS separates replicas into two categories: primary replicas and secondary replicas. A primary
replica is the chunk that a chunkserver sends to a client. Secondary replicas serve as backups on other
chunkservers. The master server decides which chunks will act as primary or secondary. If the client makes
changes to the data in the chunk, then the master server lets the chunkservers with secondary replicas know
they have to copy the new chunk off the primary chunkserver to stay current.
 While there's only one active master server per GFS cluster, copies of the master server exist on other
machines. Some copies, called shadow masters, provide limited services even when the primary master
server is active.
 The shadow master servers always lag a little behind the primary master server, but it's usually only a matter
of fractions of a second. If the primary master server fails and cannot restart, a secondary master server can
take its place.
 The GFS uses the unique chunk identifier to verify that each replica is valid. If one of the replica's handles
doesn't match the chunk handle, the master server creates a new replica and assigns it to a chunkserver.
 The master server also monitors the cluster as a whole and periodically rebalances the workload by shifting
chunks from one chunkserver to another. All chunkservers run at near capacity, but never at full capacity.
The master server also monitors chunks and verifies that each replica is current.
 If a replica doesn't match the chunk's identification number, the master server designates it as a stale replica.
The stale replica becomes garbage. After three days, the master server can delete a garbage chunk.
Client:
 First, using the fixed chunksize, the client translates the file name and byte offset specified by the
application into a chunkindex within the file.
 Then, it sends the master a request containing the file name and chunk index.
 The master replies with the corresponding chunk handle and locations of the replicas.
 The client caches this information using the file name and chunkindex as the key.
 The client then sends a request to one of the replicas, most likely the closest one.
 The request specifies the chunk handle and a byte range within that chunk. Further reads of the same
chunkrequire no more client-master interaction until the cached information expires or the file is reopened.
 In fact, the client typically asks for multiple chunks in the same request and the master can also include the
information for chunks immediately following those requested. This extra information sidesteps several
future client-master interactions at practically no extra cost.
 If a client creates a write request that affects multiple chunks of a particularly large file, the GFS breaks the
overall write request up into an individual request for each chunk. The rest of the process is the same as a
normal write request.
Chunk Size
 Chunks ize is one of the key design parameters.
 We have chosen 64 MB, which is much larger than typical file system blocksizes.
 A large chunksize offers several important advantages
 it reduces clients’ need to interact with the master because reads and writes on the same
chunkrequire only one initial request to the master for chunklocation information.The
reduction is especially significant for our workloads because applications mostly read and
write large files sequentially.
 Second, since on a large chunk, a client is more likely to perform many operations on a given
chunk, it can reduce network overhead by keeping a persistent TCP connection to the
chunkserver over an extended period of time.
 Third, it reduces the size of the metadata stored on the master.
 To prevent data corruption, the GFS uses a system called checksumming. The system breaks each 64
MB chunk into blocks of 64 kilobytes (KB). Each block within a chunk has its own 32-bit
checksum, which is sort of like a fingerprint.
Metadata
 The master stores three major types of metadata: the file and chunkna mespaces, the mapping from files to
chunks,and the locations of each chunk’s replicas. All metadata is kept in the master’s memory.
 The first two types (namespaces and file-to-chunkma pping) are also kept persistent by logging mutations
to an operation log stored on the master’s local diskan d replicated on remote machines.
 Using a log allows us to update the master state simply, reliably, and without risking inconsistencies in the
event of a master crash.
 The master does not store chunklocation information persistently. Instead, it asks each chunkserver about
its chunks at master startup and whenever a chunkserver joins the cluster.
Chunk servers:
Chunk Servers are the workhorses of the GFS. They store 64-MB file chunks. The chunk servers don't
send chunks to the master server. Instead, they send requested chunks directly to the client. The GFS
copies every chunk multiple times and stores it on different chunk servers. Each copy is called a replica.
By default, the GFS makes three replicas per chunk, but users can change the setting and make more or
fewer replicas if desired.

Control Flow - Write

step by step how lease is granted and a write on a chunk is
performed.

1. Application sends the file name and data to the GFS

client.
2. GFS Client send the file name and chunk index to
master
3. Master sends the identity of the primary and other
secondary replicas to the client.
Client caches this information. Client contacts master
again only when
primary is unreachable or it sends a reply saying it does
not holds the lease anymore.
4. Considering the network topology the client sends the
data to all the replicas.This improves performance. GFS separates data flow from the control flow.
Replicas store the data in their LRU buffers till it is used.
5. After all replicas receiving of the data, client sends write request to the primary. Primary decides the
mutation order. It applies this order to its local copy.
6. Primary sends the write request to all the secondary replicas. They perform write according to serial order
decided by the primary.
7. After completing the operation all secondary acknowledge primary.
8. Primary replies the client about completion of the operation. In case of the errors that is when some of the
secondary fail to write client request is supposed to be fail.This leaves modified chunk inconsistent.
Client handles this by retying the failed mutation. It tries steps from 3 to 7 before starting from the
beginning.
For large data write request client breaks the write into multiple write requests.

Control flow- Read

1. Application gives file name and byte range to GFS client.

2. GFS client passes file name and chunk index master.
3. Master sends chunk handle and replica locations to client.

What is Big Data

Data which are very large in size is called Big Data. Normally we work on data of size MB(WordDoc ,Excel) or
maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data. It is stated that
almost 90% of today's data has been generated in the past 3 years.

Sources of Big Data

These data come from many sources like

o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data on a
day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which users
buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are stored and
manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly publish
their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data through its daily
transaction.

5V's of Big Data

1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will double in
every 2 years.
2. Variety:It refers to nature of data that is structured, semi-structured and unstructured data.It also refers to
heterogeneous sources.
Variety is basically the arrival of data from new sources that are both inside and outside of an enterprise.
It can be structured, semi-structured and unstructured.
1. Structured data: This data is basically an organized data. It generally refers to data that has
defined the length and format of data.
2. Semi- Structured data: This data is basically a semi-organised data. It is generally a form of data
that do not conform to the formal structure of data. Log files are the examples of this type of data.
3. Unstructured data: This data basically refers to unorganized data. It generally refers to data that
doesn’t fit neatly into the traditional row and column structure of the relational database. Texts,
pictures, videos etc. are the examples of unstructured data which can’t be stored in the form of rows
and columns.
3. Volume: The amount of data which we deal with is of very large size of Peta bytes.
4. Veracity:

It refers to inconsistencies and uncertainty in data, that is data which is available can sometimes get
messy and quality and accuracy are difficult to control.
5. Value:

The bulk of Data having no Value is of no good to the company, unless you turn it into something useful.

Hadoop is an open source framework. It is provided by Apache to process and analyze very huge volume of
data. It is written in Java and currently used by Google, Facebook, LinkedIn, Yahoo, Twitter etc.

Our Hadoop tutorial includes all topics of Big Data Hadoop with HDFS, MapReduce, Yarn, Hive, HBase, Pig,
Sqoop etc.

Modules of Hadoop

1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS
was developed. It states that the files will be broken into blocks and stored in nodes over the distributed
architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation on data
using key value pair. The Map task takes input data and converts it into a data set which can be computed
in Key value pair. The output of Map task is consumed by reduce task and then the out of reducer gives
the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop modules.

Advantages of Hadoop

o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even
the tools to process the data are often on the same servers, thus reducing the processing time. It is able to
process terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost
effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one
node is down or some other network failure happens, then Hadoop takes the other copy of data and use it.
Normally, data are replicated thrice but the replication factor is configurable.

History of Hadoop

The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File System
paper, published by Google.

Let's focus on the history of Hadoop in the following steps: -

o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an open
source web crawler software project.
o While working on Apache Nutch, they were dealing with big data. To store that data they have to spend a
lot of costs which becomes the consequence of that project. This problem becomes one of the important
reason for the emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It is a proprietary
distributed file system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique simplifies the data processing on
large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS (Nutch
Distributed File System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project, Dough Cutting
introduces a new project Hadoop with a file system known as HDFS (Hadoop Distributed File System).
Hadoop first version 0.1.0 released in this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster within 209
seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.

Hadoop Distributed File System: Building Blocks Of hadoop

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a master/slave
architecture. This architecture consist of a single NameNode performs the role of master, and multiple
DataNodes performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java language is used to
develop HDFS. So any machine that supports Java language can easily run the NameNode and DataNode
software.

NameNode

o It is a single master server exist in the HDFS cluster.

o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming and closing
the files.
o It simplifies the architecture of the system.

Secondary Name Node

It is not a fail over node for the Name Node.

It is responsible for performing periodic housekeeping functions for the Name Node.

It only creates checkpoints of the filesystem present in the NameNode.

DataNode

o The HDFS cluster contains multiple DataNodes.

o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.

Job Tracker

o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using
NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker

o It works as a slave node for Job Tracker.

o It receives task and code from Job Tracker and applies that code on the file. This process can also be
called as a Mapper.
Hadoop Operation Modes
Once you have downloaded Hadoop, you can operate your Hadoop cluster in one of the
three supported modes −

 Local/Standalone Mode − After downloading Hadoop in your system, by default, it is

configured in a standalone mode and can be run as a single java process.
 Pseudo Distributed Mode − It is a distributed simulation on single machine. Each Hadoop
daemon such as hdfs, yarn, MapReduce etc., will run as a separate java process. This mode is
useful for development.
 Fully Distributed Mode − This mode is fully distributed with minimum two or more machines
as a cluster. We will come across this mode in detail in the coming chapters.

Intel DPDK Sample Applications User Guide
No ratings yet
Intel DPDK Sample Applications User Guide
136 pages
BDA Complete Notes
100% (1)
BDA Complete Notes
88 pages
Case Study On Network File System
No ratings yet
Case Study On Network File System
8 pages
Unit 2
No ratings yet
Unit 2
65 pages
Case Study Atm
100% (1)
Case Study Atm
16 pages
Fiery ES IC-415 - Fiery CS IC-308. Configuration and Setup
No ratings yet
Fiery ES IC-415 - Fiery CS IC-308. Configuration and Setup
86 pages
Smartfren Enterprise - Short Presentation
No ratings yet
Smartfren Enterprise - Short Presentation
25 pages
Anomaly Detection - Azure
No ratings yet
Anomaly Detection - Azure
115 pages
Questions On Google File System
100% (1)
Questions On Google File System
3 pages
Exploring Shape Group Hierarchy in VBA Excel
No ratings yet
Exploring Shape Group Hierarchy in VBA Excel
13 pages
File Access and Protocols Management Guide
No ratings yet
File Access and Protocols Management Guide
385 pages
Computer Organizationand Architecture Syllabus
No ratings yet
Computer Organizationand Architecture Syllabus
3 pages
Unit 2 PDF
No ratings yet
Unit 2 PDF
22 pages
BDA Unit-1
No ratings yet
BDA Unit-1
19 pages
05 en Distributed File Systems
No ratings yet
05 en Distributed File Systems
63 pages
Google File System
No ratings yet
Google File System
48 pages
Sybex CCNA 640-802: Chapter 7: Managing A Cisco Internetwork
No ratings yet
Sybex CCNA 640-802: Chapter 7: Managing A Cisco Internetwork
33 pages
Thegooglefilesystem Lecturebyromainjacotin 141001154546 Phpapp02
No ratings yet
Thegooglefilesystem Lecturebyromainjacotin 141001154546 Phpapp02
52 pages
Backup and Job Scheduling Controls
No ratings yet
Backup and Job Scheduling Controls
13 pages
DX Diag
No ratings yet
DX Diag
45 pages
Networking Project
No ratings yet
Networking Project
42 pages
Lecture 14 HDFS GFS
No ratings yet
Lecture 14 HDFS GFS
30 pages
2 FreeRTOS LwIP Zyny Zedboard
No ratings yet
2 FreeRTOS LwIP Zyny Zedboard
18 pages
Presentation ON Distributed File System: Institute of Engineering and Technology Bundelkhand University
No ratings yet
Presentation ON Distributed File System: Institute of Engineering and Technology Bundelkhand University
51 pages
The Google File System: Alexandru Costan
No ratings yet
The Google File System: Alexandru Costan
38 pages
Google File System and Hadoop Distributed File System-An Analogy
No ratings yet
Google File System and Hadoop Distributed File System-An Analogy
11 pages
BDA Unit I
No ratings yet
BDA Unit I
18 pages
Refer Slide Time: 00:15
No ratings yet
Refer Slide Time: 00:15
31 pages
15 Gfs
No ratings yet
15 Gfs
40 pages
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
No ratings yet
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
21 pages
The Google File System: Firas Abuzaid
No ratings yet
The Google File System: Firas Abuzaid
22 pages
Communication & Control Software For UTD2000 - 3000 Series Oscilloscopes User Manual V1.0
No ratings yet
Communication & Control Software For UTD2000 - 3000 Series Oscilloscopes User Manual V1.0
28 pages
Case Study: Google File System
No ratings yet
Case Study: Google File System
7 pages
Chapter 2 1712934164766
No ratings yet
Chapter 2 1712934164766
21 pages
2 GFS
No ratings yet
2 GFS
30 pages
1564-Article Text-2810-1-10-20171231 PDF
No ratings yet
1564-Article Text-2810-1-10-20171231 PDF
5 pages
18-Distributed File Systems Study On Operating Systems
No ratings yet
18-Distributed File Systems Study On Operating Systems
24 pages
Google File System
No ratings yet
Google File System
20 pages
Storage Systems
No ratings yet
Storage Systems
23 pages
Bda Material Unit 2
No ratings yet
Bda Material Unit 2
19 pages
The Google File System Final
No ratings yet
The Google File System Final
20 pages
Unit 2
No ratings yet
Unit 2
22 pages
Hadoop and Big Data Unit 2
No ratings yet
Hadoop and Big Data Unit 2
11 pages
Google File System 1
No ratings yet
Google File System 1
48 pages
Gfs Google File System 13331
No ratings yet
Gfs Google File System 13331
28 pages
Distributed Computing Module 5 Important Topics PYQs
No ratings yet
Distributed Computing Module 5 Important Topics PYQs
23 pages
Chap 6
No ratings yet
Chap 6
54 pages
Google File System (GFS)
No ratings yet
Google File System (GFS)
18 pages
DS Lecture 5
No ratings yet
DS Lecture 5
28 pages
The Google File System: S. Ghemawat, H. Gobioff, and S. T. Leung. SOSP 2003
No ratings yet
The Google File System: S. Ghemawat, H. Gobioff, and S. T. Leung. SOSP 2003
33 pages
R16 4-1 BDA - Unit-2 (Ref-3)
No ratings yet
R16 4-1 BDA - Unit-2 (Ref-3)
22 pages
Unit 5 Lecture 2
No ratings yet
Unit 5 Lecture 2
22 pages
Chapter 5a
No ratings yet
Chapter 5a
23 pages
A.How To Create First Fiori App Using Templates in Web IDE
No ratings yet
A.How To Create First Fiori App Using Templates in Web IDE
14 pages
Distributed File System Google File System
No ratings yet
Distributed File System Google File System
44 pages
Google File System
No ratings yet
Google File System
9 pages
Chapter 2 Google File System 250525 070947
No ratings yet
Chapter 2 Google File System 250525 070947
42 pages
The Google File System
No ratings yet
The Google File System
21 pages
ASM8085
No ratings yet
ASM8085
15 pages
RPM Queryformat
No ratings yet
RPM Queryformat
17 pages
ShipManager Installation Guide - 5333354
No ratings yet
ShipManager Installation Guide - 5333354
7 pages
Google File System
No ratings yet
Google File System
22 pages
Case Study
No ratings yet
Case Study
6 pages
SPI & DPI Commands Slave Mode Notes: MV Setpoints
No ratings yet
SPI & DPI Commands Slave Mode Notes: MV Setpoints
4 pages
Case Study On Network File System
No ratings yet
Case Study On Network File System
8 pages
AnalyzingGFS HDFS
No ratings yet
AnalyzingGFS HDFS
11 pages
M4 - 05 - Google File System
No ratings yet
M4 - 05 - Google File System
28 pages
CUCM PLAR Configuration Example
No ratings yet
CUCM PLAR Configuration Example
9 pages
40 Parekh Harsh Assignment 2
No ratings yet
40 Parekh Harsh Assignment 2
10 pages
Paper Gfs Summary
No ratings yet
Paper Gfs Summary
14 pages
An Overview of Google File System (GFS) - Medium
No ratings yet
An Overview of Google File System (GFS) - Medium
10 pages
GPS Vs Hdfs
No ratings yet
GPS Vs Hdfs
6 pages
LDIFF
No ratings yet
LDIFF
6 pages
Installation Instructions - Droid@Screen
No ratings yet
Installation Instructions - Droid@Screen
8 pages
Demands of Google's Data Processing Needs. Performance, Scalability, Reliability, and Availability. A Proprietary DFS
No ratings yet
Demands of Google's Data Processing Needs. Performance, Scalability, Reliability, and Availability. A Proprietary DFS
9 pages
Eegame Logcat
No ratings yet
Eegame Logcat
11 pages
115 Tip Pthreads
No ratings yet
115 Tip Pthreads
4 pages
Chunky
No ratings yet
Chunky
3 pages
Servview17": Contact Us Worldwide: WWW - Blackbox.Eu
No ratings yet
Servview17": Contact Us Worldwide: WWW - Blackbox.Eu
4 pages
Unit 3.4 Gfs and Hdfs
No ratings yet
Unit 3.4 Gfs and Hdfs
4 pages
36 DC Expt9
No ratings yet
36 DC Expt9
4 pages
DS Mod 5.2
No ratings yet
DS Mod 5.2
6 pages
Google File System Basics: Google World Wide Web Computers
No ratings yet
Google File System Basics: Google World Wide Web Computers
5 pages
Tips and Tricks - Bold 9700
No ratings yet
Tips and Tricks - Bold 9700
6 pages
What Is Distributed Data Processing?
No ratings yet
What Is Distributed Data Processing?
2 pages
SiCK-68 Cost Breakdown
No ratings yet
SiCK-68 Cost Breakdown
2 pages
Downloading The Software
No ratings yet
Downloading The Software
5 pages
MIT 6.824 - Lecture 3 - GFS
No ratings yet
MIT 6.824 - Lecture 3 - GFS
1 page
“Information Systems Unraveled: Exploring the Core Concepts”: GoodMan, #1
From Everand
“Information Systems Unraveled: Exploring the Core Concepts”: GoodMan, #1
Patrick Mukosha
No ratings yet
Gluster Filesystem - Practical Method
From Everand
Gluster Filesystem - Practical Method
Fabian Mestre
No ratings yet

Unit-II (BIG DATA)

Uploaded by

Unit-II (BIG DATA)

Uploaded by

Working with Big Data [UNIT-II]

The Google File System (One master and multiple chunkservers)

Control Flow - Write

1. Application sends the file name and data to the GFS

Control flow- Read

1. Application gives file name and byte range to GFS client.

What is Big Data

Sources of Big Data

These data come from many sources like

5V's of Big Data

Let's focus on the history of Hadoop in the following steps: -

Hadoop Distributed File System: Building Blocks Of hadoop

o It is a single master server exist in the HDFS cluster.

Secondary Name Node

It is not a fail over node for the Name Node.

It only creates checkpoints of the filesystem present in the NameNode.

o The HDFS cluster contains multiple DataNodes.

o It works as a slave node for Job Tracker.

 Local/Standalone Mode − After downloading Hadoop in your system, by default, it is

You might also like