0% found this document useful (0 votes)
59 views9 pages

Unit-II (BIG DATA)

The document discusses Google's File System (GFS), which is designed to store and process huge files across large clusters of computers. The GFS breaks files into 64MB chunks that are replicated across multiple chunkservers for reliability. A single master server coordinates the cluster by tracking metadata and chunk locations. Clients communicate directly with chunkservers to read and write data chunks, avoiding bottlenecks at the master. The GFS uses this distributed design to efficiently process big data across thousands of machines.

Uploaded by

gunturanusha88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views9 pages

Unit-II (BIG DATA)

The document discusses Google's File System (GFS), which is designed to store and process huge files across large clusters of computers. The GFS breaks files into 64MB chunks that are replicated across multiple chunkservers for reliability. A single master server coordinates the cluster by tracking metadata and chunk locations. Clients communicate directly with chunkservers to read and write data chunks, avoiding bottlenecks at the master. The GFS uses this distributed design to efficiently process big data across thousands of machines.

Uploaded by

gunturanusha88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Working with Big Data [UNIT-II]

 In a way, big data is exactly what it sounds like -- a lot of data. Since the advent of the Internet, we've been
producing data in staggering amounts. It's been estimated that in all the time leading up to the year 2003,
only 5 exabytes of data were generated -- that's equal to 5 billion gigabytes.
 But from 2003 to 2012, the amount reached around 2.7 zettabytes (or 2,700 exabytes, or 2.7 trillion
gigabytes.
 According to Berkeley researchers, we are now producing roughly 5 quintillion bytes (or around 4.3
exabytes) of data every two days.
 The term 'big data' is usually used to refer to massive, rapidly expanding, varied and often unstructured sets
of digitized data that are difficult to maintain using traditional databases.
 A distributed file system is a client/server-based application that allows clients to access and process data
stored on the server as if it were on their own computer. When a user accesses a file on the server, the
server sends the user a copy of the file, which is cached on the user's computer while the data is being
processed and is then returned to the server.
 more than one client may access the same data simultaneously, the server must have a mechanism in place
(such as maintaining information about the times of access) to organize updates so that the client always
receives the most current version of data and that data conflicts do not arise. Distributed file systems
typically use file or database replication (distributing copies of data on multiple servers) to protect against
data access failures.
 Sun Microsystems' Network File System (NFS), Novell NetWare, Microsoft's Distributed File System, and
IBM/Transarc's DFS are some examples of distributed file systems.
 Google is a multi-billion dollar company. It's one of the big power players on the World Wide Web and
beyond. The company relies on a distributed computing system to provide users with the infrastructure
they need to access, create and alter data.
 Google uses the GFS to organize and manipulate huge files and to allow application developers the
research and development resources they require. The GFS is unique to Google and isn't for sale.

The Google File System (One master and multiple chunkservers)

 The challenge for the GFS team was to not only create an automatic monitoring system, but also to design
it so that it could work across a huge network of computers.
 The GFS team decided that users would have access to basic file commands. These include commands like
open, create, read, write and close files. The team also included a couple of specialized commands:
append and snapshot.
 A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients,
 Files on the GFS tend to be very large, usually in the multi-gigabyte (GB) range. Accessing and
manipulating files that large would take up a lot of the network's bandwidth.
 Bandwidth is the capacity of a system to move data from one location to another.
 The GFS addresses this problem by breaking files up into chunks of 64 megabytes (MB) each. Every chunk
receives a unique 64-bit identification number called a chunk handle assigned by the master at the time of
chunk creation.
 While the GFS can process smaller files, its developers didn't optimize the system for those kinds of tasks.
 Google organized the GFS into clusters of computers. A cluster is simply a network of computers. Each
cluster might contain hundreds or even thousands of machines. Within GFS clusters there are three kinds of
entities: clients, master servers and chunkservers.
 In the world of GFS, the term "client" refers to any entity that makes a file request. Requests can range
from retrieving and manipulating existing files to creating new files on the system. Clients can be other
computers or computer applications. You can think of clients as the customers of the GFS.
 The master server acts as the coordinator for the cluster. The master's duties include maintaining an
operation log, which keeps track of the activities of the master's cluster. The operation log helps keep
service interruptions to a minimum -- if the master server crashes, a replacement server that has monitored
the operation log can take its place. The master server also keeps track of metadata, which is the
information that describes chunks. The metadata tells the master server to which files the chunks belong
and where they fit within the overall file. Upon startup, the master polls all the chunkservers in its cluster.
The chunkservers respond by telling the master server the contents of their inventories. From that moment
on, the master server keeps track of the location of chunks within the cluster.
 There's only one active master server per cluster at any one time (though each cluster has multiple copies of
the master server in case of a hardware failure). That might sound like a good recipe for a bottleneck --
after all, if there's only one machine coordinating a cluster of thousands of computers, wouldn't that cause
data traffic jams? The GFS gets around this sticky situation by keeping the messages the master server
sends and receives very small. The master server doesn't actually handle file data at all. It leaves that up to
the chunkservers
 Chunkservers are the workhorses of the GFS. They're responsible for storing the 64-MB file chunks. The
chunkservers don't send chunks to the master server. Instead, they send requested chunks directly to the
client. The GFS copies every chunk multiple times and stores it on different chunkservers. Each copy is
called a replica. By default, the GFS makes three replicas per chunk, but users can change the setting and
make more or fewer replicas if desired.
 How do these elements work together during a routine process? Find out in the next section.

 Chunkservers store chunks on local disks as Linux files and read or write chunkda ta specified by a chunkha
ndle and byte range.
 For reliability, each chunkis replicated on multiple chunkservers.
 By default, we store three replicas, though users can designate different replication levels for different
regions of the file namespace.
Single Master
 The master maintains all file system metadata. This includes the namespace, access control information, the
mapping from files to chunks, and the current locations of chunks.It also controls system-wide activities
such as chunklease management, garbage collection of orphaned chunks, and chunkmigration between
chunkservers. The master periodically communicates with each chunkserver in HeartBeat messages to give
it instructions and collect its state.
 GFS client code linked into each application implements the file system API and communicates with the
master and chunkservers to read or write data on behalf of the application. Clients interact with the master
for metadata operations, but all data-bearing communication goes directly to the chunkservers.
 Neither the client nor the chunkserver caches file data.
 Clients never read and write file data through the master. Instead, a client asks the master which
chunkservers it should contact. It caches this information for a limited time and interacts with the
chunkservers directly for many subsequent operations.
 The GFS separates replicas into two categories: primary replicas and secondary replicas. A primary
replica is the chunk that a chunkserver sends to a client. Secondary replicas serve as backups on other
chunkservers. The master server decides which chunks will act as primary or secondary. If the client makes
changes to the data in the chunk, then the master server lets the chunkservers with secondary replicas know
they have to copy the new chunk off the primary chunkserver to stay current.
 While there's only one active master server per GFS cluster, copies of the master server exist on other
machines. Some copies, called shadow masters, provide limited services even when the primary master
server is active.
 The shadow master servers always lag a little behind the primary master server, but it's usually only a matter
of fractions of a second. If the primary master server fails and cannot restart, a secondary master server can
take its place.
 The GFS uses the unique chunk identifier to verify that each replica is valid. If one of the replica's handles
doesn't match the chunk handle, the master server creates a new replica and assigns it to a chunkserver.
 The master server also monitors the cluster as a whole and periodically rebalances the workload by shifting
chunks from one chunkserver to another. All chunkservers run at near capacity, but never at full capacity.
The master server also monitors chunks and verifies that each replica is current.
 If a replica doesn't match the chunk's identification number, the master server designates it as a stale replica.
The stale replica becomes garbage. After three days, the master server can delete a garbage chunk.
Client:
 First, using the fixed chunksize, the client translates the file name and byte offset specified by the
application into a chunkindex within the file.
 Then, it sends the master a request containing the file name and chunk index.
 The master replies with the corresponding chunk handle and locations of the replicas.
 The client caches this information using the file name and chunkindex as the key.
 The client then sends a request to one of the replicas, most likely the closest one.
 The request specifies the chunk handle and a byte range within that chunk. Further reads of the same
chunkrequire no more client-master interaction until the cached information expires or the file is reopened.
 In fact, the client typically asks for multiple chunks in the same request and the master can also include the
information for chunks immediately following those requested. This extra information sidesteps several
future client-master interactions at practically no extra cost.
 If a client creates a write request that affects multiple chunks of a particularly large file, the GFS breaks the
overall write request up into an individual request for each chunk. The rest of the process is the same as a
normal write request.
Chunk Size
 Chunks ize is one of the key design parameters.
 We have chosen 64 MB, which is much larger than typical file system blocksizes.
 A large chunksize offers several important advantages
 it reduces clients’ need to interact with the master because reads and writes on the same
chunkrequire only one initial request to the master for chunklocation information.The
reduction is especially significant for our workloads because applications mostly read and
write large files sequentially.
 Second, since on a large chunk, a client is more likely to perform many operations on a given
chunk, it can reduce network overhead by keeping a persistent TCP connection to the
chunkserver over an extended period of time.
 Third, it reduces the size of the metadata stored on the master.
 To prevent data corruption, the GFS uses a system called checksumming. The system breaks each 64
MB chunk into blocks of 64 kilobytes (KB). Each block within a chunk has its own 32-bit
checksum, which is sort of like a fingerprint.
Metadata
 The master stores three major types of metadata: the file and chunkna mespaces, the mapping from files to
chunks,and the locations of each chunk’s replicas. All metadata is kept in the master’s memory.
 The first two types (namespaces and file-to-chunkma pping) are also kept persistent by logging mutations
to an operation log stored on the master’s local diskan d replicated on remote machines.
 Using a log allows us to update the master state simply, reliably, and without risking inconsistencies in the
event of a master crash.
 The master does not store chunklocation information persistently. Instead, it asks each chunkserver about
its chunks at master startup and whenever a chunkserver joins the cluster.
Chunk servers:
Chunk Servers are the workhorses of the GFS. They store 64-MB file chunks. The chunk servers don't
send chunks to the master server. Instead, they send requested chunks directly to the client. The GFS
copies every chunk multiple times and stores it on different chunk servers. Each copy is called a replica.
By default, the GFS makes three replicas per chunk, but users can change the setting and make more or
fewer replicas if desired.

Control Flow - Write


step by step how lease is granted and a write on a chunk is
performed.

1. Application sends the file name and data to the GFS


client.
2. GFS Client send the file name and chunk index to
master
3. Master sends the identity of the primary and other
secondary replicas to the client.
Client caches this information. Client contacts master
again only when
primary is unreachable or it sends a reply saying it does
not holds the lease anymore.
4. Considering the network topology the client sends the
data to all the replicas.This improves performance. GFS separates data flow from the control flow.
Replicas store the data in their LRU buffers till it is used.
5. After all replicas receiving of the data, client sends write request to the primary. Primary decides the
mutation order. It applies this order to its local copy.
6. Primary sends the write request to all the secondary replicas. They perform write according to serial order
decided by the primary.
7. After completing the operation all secondary acknowledge primary.
8. Primary replies the client about completion of the operation. In case of the errors that is when some of the
secondary fail to write client request is supposed to be fail.This leaves modified chunk inconsistent.
Client handles this by retying the failed mutation. It tries steps from 3 to 7 before starting from the
beginning.
For large data write request client breaks the write into multiple write requests.

Control flow- Read

1. Application gives file name and byte range to GFS client.


2. GFS client passes file name and chunk index master.
3. Master sends chunk handle and replica locations to client.

What is Big Data

Data which are very large in size is called Big Data. Normally we work on data of size MB(WordDoc ,Excel) or
maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data. It is stated that
almost 90% of today's data has been generated in the past 3 years.

Sources of Big Data

These data come from many sources like

o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data on a
day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which users
buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are stored and
manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly publish
their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data through its daily
transaction.

5V's of Big Data

1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will double in
every 2 years.
2. Variety:It refers to nature of data that is structured, semi-structured and unstructured data.It also refers to
heterogeneous sources.
Variety is basically the arrival of data from new sources that are both inside and outside of an enterprise.
It can be structured, semi-structured and unstructured.
1. Structured data: This data is basically an organized data. It generally refers to data that has
defined the length and format of data.
2. Semi- Structured data: This data is basically a semi-organised data. It is generally a form of data
that do not conform to the formal structure of data. Log files are the examples of this type of data.
3. Unstructured data: This data basically refers to unorganized data. It generally refers to data that
doesn’t fit neatly into the traditional row and column structure of the relational database. Texts,
pictures, videos etc. are the examples of unstructured data which can’t be stored in the form of rows
and columns.
3. Volume: The amount of data which we deal with is of very large size of Peta bytes.
4. Veracity:

It refers to inconsistencies and uncertainty in data, that is data which is available can sometimes get
messy and quality and accuracy are difficult to control.
5. Value:

The bulk of Data having no Value is of no good to the company, unless you turn it into something useful.

Hadoop is an open source framework. It is provided by Apache to process and analyze very huge volume of
data. It is written in Java and currently used by Google, Facebook, LinkedIn, Yahoo, Twitter etc.

Our Hadoop tutorial includes all topics of Big Data Hadoop with HDFS, MapReduce, Yarn, Hive, HBase, Pig,
Sqoop etc.

Modules of Hadoop

1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS
was developed. It states that the files will be broken into blocks and stored in nodes over the distributed
architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation on data
using key value pair. The Map task takes input data and converts it into a data set which can be computed
in Key value pair. The output of Map task is consumed by reduce task and then the out of reducer gives
the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop modules.

Advantages of Hadoop

o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even
the tools to process the data are often on the same servers, thus reducing the processing time. It is able to
process terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost
effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one
node is down or some other network failure happens, then Hadoop takes the other copy of data and use it.
Normally, data are replicated thrice but the replication factor is configurable.

History of Hadoop

The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File System
paper, published by Google.

Let's focus on the history of Hadoop in the following steps: -

o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an open
source web crawler software project.
o While working on Apache Nutch, they were dealing with big data. To store that data they have to spend a
lot of costs which becomes the consequence of that project. This problem becomes one of the important
reason for the emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It is a proprietary
distributed file system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique simplifies the data processing on
large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS (Nutch
Distributed File System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project, Dough Cutting
introduces a new project Hadoop with a file system known as HDFS (Hadoop Distributed File System).
Hadoop first version 0.1.0 released in this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster within 209
seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.

Hadoop Distributed File System: Building Blocks Of hadoop

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a master/slave
architecture. This architecture consist of a single NameNode performs the role of master, and multiple
DataNodes performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java language is used to
develop HDFS. So any machine that supports Java language can easily run the NameNode and DataNode
software.

NameNode

o It is a single master server exist in the HDFS cluster.


o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming and closing
the files.
o It simplifies the architecture of the system.

Secondary Name Node

It is not a fail over node for the Name Node.

It is responsible for performing periodic housekeeping functions for the Name Node.

It only creates checkpoints of the filesystem present in the NameNode.

DataNode

o The HDFS cluster contains multiple DataNodes.


o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.

Job Tracker

o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using
NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker

o It works as a slave node for Job Tracker.


o It receives task and code from Job Tracker and applies that code on the file. This process can also be
called as a Mapper.
Hadoop Operation Modes
Once you have downloaded Hadoop, you can operate your Hadoop cluster in one of the
three supported modes −

 Local/Standalone Mode − After downloading Hadoop in your system, by default, it is


configured in a standalone mode and can be run as a single java process.
 Pseudo Distributed Mode − It is a distributed simulation on single machine. Each Hadoop
daemon such as hdfs, yarn, MapReduce etc., will run as a separate java process. This mode is
useful for development.
 Fully Distributed Mode − This mode is fully distributed with minimum two or more machines
as a cluster. We will come across this mode in detail in the coming chapters.

You might also like