0% found this document useful (0 votes)
54 views16 pages

RTK Notes m1

Uploaded by

DARSHAN DARSH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views16 pages

RTK Notes m1

Uploaded by

DARSHAN DARSH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Introduction to Big Data 21CS753

Introduction to Big Data

Introduction
What is Data?
Data refers to raw facts and figures. Examples include numbers, text, images, and sounds.
Types of Data
• Structured Data:
• Organized in rows and columns (e.g., databases, spreadsheets).
• Easily searchable and analyzable.
• Unstructured Data:
• Unstructured data is information that doesn't reside in a traditional row-column
database.
• social media comments, tweets, shares, posts, the YouTube videos users view,
and the WhatsApp text messages.
• Semi-Structured Data:
• Contains tags or markers to separate data elements (e.g., XML, JSON).

Dept. Of CSE--Raghavendra T.K Page 1


Introduction to Big Data 21CS753

What is Big Data?

• Definition: Big Data refers to large, complex datasets that traditional data processing
software can't handle.
• Example 1 : Social media data, sensor data, E-mails, Zipped files, Web pages, etc.
Example 2 : FaceBook According to Facebook, its data system processes 500+ terabytes
of data daily. Facebook generates 2.7 billion Like actions per day and 300 million new
photos are uploaded daily. It has 2.38 billion users. Allows searching, recommendation.

Characteristics of Big Data (The 4 Vs)


• Volume: The amount of data generated.
• Velocity: The speed at which data is generated and processed.
• Variety: Different types of data (structured, unstructured, semi-structured).
• Veracity: The trust worthiness and quality of data.

Challenges of Big Data


• Data Storage: Big Data involves a huge amount of information that traditional storage
systems can't handle. We need advanced storage solutions.
• Data Processing: The speed at which data is created is so fast that normal computers
struggle to process it. We use special tools like Hadoop or Spark to process this data
quickly and efficiently.
• Data Analysis: Analyzing vast amounts of structured and unstructured data can be
complex. Advanced algorithms, machine learning models, and powerful computational
resources are essential to extract meaningful insights from Big Data.
• Data Security and Privacy: With big data, keeping it safe from hackers and making sure
personal information is protected is a big challenge. We need strong security measures.

Apache Hadoop
• Hadoop is one of the solution for the Big Data problems.
• Hadoop is an open-source Java based framework software framework developed by the
Apache Software Foundation.
• Hadoop is used for distributed storage and processing of large datasets using a network
of computers.
• Started in 2005, developed by Doug Cutting and Mike Cafarella.
Hadoop Framework
• The Hadoop framework refers to the core components used for distributed storage and
processing of large data sets across clusters of computers. The core components are:
HDFS (Hadoop Distributed File System): used for storing data in Hadoop cluster.
MapReduce: used for Parallel processing of large data sets.
YARN (Yet Another Resource Negotiator): used for Resource management.

Dept. Of CSE--Raghavendra T.K Page 2


Introduction to Big Data 21CS753

Hadoop Ecosystem
• Hadoop Ecosystem includes various other tools that enhance data processing capabilities.
Example: Hive, Pig, HBase, Sqoop, Flume.

Module 1 Hadoop Distributed File System

Hadoop Distributed File system design features:


• The Hadoop Distributed file system(HDFS) was designed for Big Data processing and is
capable of supporting many users simultaneously. The design assumes a large file write-
once/read-many model. HDFS restricts data writing to one user at a time. All additional
writes are “append-only,” and there is no random writing to HDFS files.
• HDFS is designed for data streaming where large amounts of data are read from disk in
bulk. The HDFS block size is typically 64MB or 128MB.
• There is no local caching mechanism. The large block and file sizes makes it more
efficient to reread data from HDFS than to try to cache the data.
• Hadoop MapReduce moves the computation to the data rather than moving the data to the
computation. That is, Converged data storage and processing happen on the same server
or data nodes.
• A reliable file system maintains multiple copies of data across the cluster. Consequently,
failure of a single will not bring down the file system.
• A specialized file system is used, which is not designed for general use.

HDFS components
The design of HDFS is based on two types of nodes: NameNode and multiple DataNodes.

In a basic design, NameNode manages all the metadata needed to store and retrieve the actual
data from the DataNodes. The NameNode stores all metadata in memory. No data is actually
stored on the NameNode.

The design is a Master/Slave architecture in which master(NameNode) regulates access to files


by clients. File system operations such as opening, closing and renaming files and directories are
all managed by the NameNode. The NameNode also determines the mapping of blocks to
DataNodes and handles DataNode failures. The NameNode manages block creation, deletion and
replication.

Dept. Of CSE--Raghavendra T.K Page 3


Introduction to Big Data 21CS753

The slave(DataNodes) are responsible for serving read and write requests from the file system to
the clients.

An example of the client/NameNode/DataNode interaction is provided in figure.

When a client writes data, it first communicates with the NameNode and requests to create a file.
The NameNode determines how many blocks are needed and provides the client with the
DataNodes that will store the data.

As part of the storage process, the data blocks are replicated after they are written to the assigned
node. The NameNode will attempt to write replicas of the data blocks on nodes that are in other
separate racks. After the Data Node acknowledges that the file block replication is complete, the
client closes the file and informs the NameNode that the operation is complete.

The client requests a file from the NameNode, which returns the best DataNodes from which to
read the data. The client then access the data directly from the DataNodes. Thus, once the
metadata has been delivered to the client, the NameNode steps back and lets the conversation
between the client and the DataNodes proceed.

While data transfer is progressing, the NameNode also monitors the DataNodes by listening for
heartbeats sent from DataNodes. The lack of a heartbeat signal indicates a node failure. When a
DataNode fails , the NameNode will begin re-replicating the missing blocks.

In almost all Hadoop deployments, there is a SecondaryNameNode(Checkpoint Node). The


purpose of this node is to perform periodic checkpoints that evaluate the status of the
NameNode. It has two disk files that track changes to the metadata:

Dept. Of CSE--Raghavendra T.K Page 4


Introduction to Big Data 21CS753

– An image or information of the file system state when the NameNode was started. This file is
fsimage.
– A series of modifications done to the file system after starting the NameNode. This file is edits
file and reflect the changes made after the file was read.

The Secondary NameNode periodically downloads fsimage and edits files, joins them into a new
fsimage and uploads the new fsimage file to the NameNode. Thus, when the NameNode restarts,
the fsimage file is up-to-date.

Finally, the various roles in HDFS can be summarized as follows:

• HDFS uses a master/slave architecture to design large file reading/streaming.


• The NameNode is a metadata server or “data traffic cop.”
• HDFS provides a single namespace that is managed by the NameNode.
• Data is redundantly stored on DataNodes; there is no data on the NameNode.
• The SecondaryNameNode performs checkpoints of the NameNode file system’s state
but is not a failover node.

HDFS Block Replication

• When HDFS writes a file, it is replicated across the cluster. For Hadoop clusters
containing more than eight DataNodes, the replication value is usually set to 3.
• The HDFS default block size is often 64MB. If a file of size 80MB is written to HDFS, a
64MB block and a 16MB block will be created.
• Figure provides an example of how a file is broken into blocks and replicated across the
cluster. In this case, a replication factor of 3 ensures that any one DataNode can fail and
the replicated blocks will be available on other nodes—and then subsequently re-
replicated on other DataNodes.

Figure: HDFS block replication example

Dept. Of CSE--Raghavendra T.K Page 5


Introduction to Big Data 21CS753

HDFS Safe Mode


• When the NameNode starts, it enters a read-only safe mode where blocks cannot be
replicated or deleted. Safe Mode enables the NameNode to perform two important
processes:
• The previous file system state is reconstructed by loading the fsimage file into memory
and replaying the edit log.
• The mapping between blocks and data nodes is created by waiting for enough Of the
DataNodes to register so that at least one copy of the data is available.

Rack Awareness

• Rack awareness about knowing where data is stored in a Hadoop system. It deals with
data locality which is moving computation to the node where data resides.
• Hadoop cluster will exhibit three levels of data locality:
• Data resides on the local machine .
• Data resides in the same rack.
• Data resides in a different rack.
• To protect against failures, the system makes copies of data and stores them across
different racks. So, if one rack fails, the data is still safe and available from another rack,
keeping the system running without losing data.

NameNode High Availability

As shown in Figure, an HA Hadoop cluster has two (or more) separate NameNode machines.
Each machine is configured with exactly the same software.

One of the NameNode machines is in the Active state, and the other is in the Standby state. The
Active NameNode is responsible for all client HDFS operations in the cluster. The Standby
NameNode maintains enough state to provide a fast failover (if required).

Dept. Of CSE--Raghavendra T.K Page 6


Introduction to Big Data 21CS753

Both the Active and Standby NameNodes receive block reports from the DataNodes. The Active
node also sends all file system edits to a quorum of Journal nodes. Journal nodes are systems for
storing edits. At least three physically separate Journal Node are required. This design will
enable the system to tolerate the failure of a single JournalNode machine.

The Standby node continuously reads the edits from the JournalNodes to ensure its namespace is
synchronized with that of the Active node. In the event of an Active NameNode failure, the
Standby node reads all remaining edits from the JournalNodes before promoting itself to the
Active state.

HDFS NameNode Federation


Another important feature of HDFS is NameNode Federation. Federation supports for multiple
NameNodes/namespaces to the HDFS file system. The key benefits are as follows:
• Namespace scalability. HDFS cluster storage scales horizontally without placing a
burden on the NameNode.
• Better performance. Adding more NameNodes to the cluster scales the file system
read/write operations throughput by separating the total namespace.
• System isolation. Multiple NameNodes enable different categories of applications to be
distinguished, and users can be isolated to different namespaces.
Figure illustrates how HDFS NameNode Federation is accomplished.
• NameNode1 manages the research and marketing namespaces, and NameNode2 manages
the data and project namespaces. The NameNodes do not communicate with each other
and the DataNodes “just store data block” as directed by either NameNode.

HDFS Checkpoints
• The NameNode stores the metadata of the HDFS file system in a file called fsimage.
• File systems modifications are written to an edits log file, and at startup the NameNode
merges the edits into a new fsimage.

Dept. Of CSE--Raghavendra T.K Page 7


Introduction to Big Data 21CS753

• The SecondaryNameNode or CheckpointNode periodioally fetches edits from the


NameNode, merges them, and returns an updated fsimage to the NameNode.

HDFS Backups
• An HDFS BackupNode maintains an up-to-date copy of the metadata both in memory
and on disk.
• The BackupNode does not need to download the fsimage and edits files from the active
NameNode because it already has an up-to-date metadata state in memory.
• A NameNode supports one BackupNode at a time. No CheckpointNodes may be
registered if a Backup node is in use.
HDFS Snapshots
• HDFS snapshots are similar to backups, but are created by administrators using the hdfs
dfs -snapshot command.
• HDFS snapshots are read-only copies of the file system.
• Snapshots can be taken of a sub-tree of the file system or the entire file system.
• Snapshots can be used for data backup.Snapshot creation is instantaneous.
• Blocks on the DataNodes are not copied, because the snapshot files record the block list
and the file size.
• There is no data copying.
******
HDFS user commands
• The preferred way to interact with HDFS is through the hdfs command and will facilitate
navigation within HDFS.
• The following listing presents the full range of options that are available for the hdfs
command. In the next section, only portions of the dfs and hdfsadmin options are given.

General HDFS Commands


The version of HDFS can be found from the version option.
$ hdfs version
Hadoop 2.6.0.2.2.4.2-2

Dept. Of CSE--Raghavendra T.K Page 8


Introduction to Big Data 21CS753

List Files in HDFS


To list the files in the root HDFS directory, enter the following command:
$ hdfsdfs -ls /
Output:
Found 2 items
drwxrwxrwx - yarn hadoop 0 2015-04-29 16:52 /app-logs
drwxr-xr-x - hdfshdfs 0 2015-04-21 14:28 /apps

To list files in our home directory, enter the following command:


Syntax: $ hdfsdfs -ls
Output:
Found 2 items
drwxr-xr-x - hdfshdfs 0 2015-05-24 20:06 bin
drwxr-xr-x - hdfshdfs 0 2015-04-29 16:52

Make a Directory in HDFS


To make a directory in HDFS, use the following command.
$ hdfs dfs -mkdir stuff
Copy Files to HDFS
• To copy a file from your current local directory into HDFS, use the following command.
• If a full path is not supplied, your home directory is assumed.
• In this case, the file test is placed in the directory stuff that was created previously.
$ hdfs dfs -put test stuff
• The file transfer can be confirmed by using the -ls command:
$ hdfs dfs -ls stuff
Found 1 items
-rw-r--r-- 2 hdfs hdfs 12857 2015-05-29 13:12 stuff/test
Copy Files from HDFS
• Files can be copied back to your local file system using the following command.
• In this case, the file test from HDFS, will be copied back to the current local directory
with the name test-local.
$ hdfs dfs -get stuff/test test-local
Copy Files within HDFS
The following command will copy a file in HDFS
$ hdfs dfs -cp stuff/test test.hdfs
Delete a File within HDFS
The following command will delete the HDFS file test.hdfs
$ hdfs dfs -rm test.hdfs
Delete a Directory in HDFS
The following command will delete the HDFS directory stuff and all its contents:
$ hdfs dfs -rm -r -skipTrash stuff
Deleted stuff

Dept. Of CSE--Raghavendra T.K Page 9


Introduction to Big Data 21CS753

Get an HDFS Status Report


HDFS status report using the following command. Those with HDFS administrator privileges
will get a full report. Also, this command uses dfsadmin instead of dfs to invoke administrative
commands.
$ hdfs dfsadmin -report
Configured Capacity: 1503409881088 (1.37 TB)
Present Capacity: 1407945981952 (1.28 TB)
DFS Remaining: 1255510564864 (1.14 TB)
DFS Used: 152435417088 (141.97 GB)
DFS Used%: 10.83%
Under replicated blocks: 54
Blocks with corrupt replicas: 0
Missing blocks: 0
******
Hadoop MapReduce Framework

MapReduce Model
• HDFS distributes and replicates data over multiple data nodes.
• Apache Hadoop MapReduce will try to move the mapping tasks to the data nodes that
contains the data slice. Results from each data slice are then combined in the reducer
step.
• In the MapReduce computation model, there are two stages: a mapping stage and a
reducing stage.

• In the mapping stage, a mapping procedure is applied to input data. The map is usually
some kind of filter or sorting process.
• For instance, assume we need to count how many times the name “Ram" appears in the
novel War and Peace. One solution is to gather 20 friends and give them each a section
of the book to search. This step is the map stage. The reduce phase happens when
everyone is done counting and sum the total as friends tell their counts.
• Now consider how this same process could be accomplished using the two simple UNIX
shell scripts shown.

$ cat war-and-peace.txt | ./mapper.sh | ./reducer.sh

Dept. Of CSE--Raghavendra T.K Page 10


Introduction to Big Data 21CS753

Ram,315

Here, the mapper inputs a text file and then outputs data in a (key, value) pair (token-
name, count) format.The input to the script is the file and the key is Ram.The reducer
script takes these key–value pairs and combines the similar tokens and counts the total
number of instances. The result is a new key–value pair (token-name, sum).
• Distributed (parallel) implementations of MapReduce enable large amounts of data to be
analyzed quickly. In general, the mapper process is fully scalable and can be applied to
any subset of the input data. Results from multiple parallel mapping functions are then
combined in the reducer phase.
• Hadoop accomplishes parallelism by using a distributed file system (HDFS) to slice and
spread data over multiple data nodes.

MapReduce Parallel Data Flow


• HDFS distributes and replicates data over multiple data nodes.

• Apache Hadoop MapReduce will try to move the mapping tasks to the data nodes that
contains the data slice.Results from each data slice are then combined in the reducer step.

• Parallel execution of MapReduce requires other steps in addition to the mapper and
reducer processes.

• The basic steps are as follows:

1. Input Splits.

• HDFS distributes and replicates data over multiple servers.

• The default data block size is 64MB. Thus, a 500MB file would be broken into 8 blocks
and written to different machines in the cluster.

• The data are also replicated on multiple machines (typically three machines).

2. Map Step.

• The user provides the specific mapping process.

• MapReduce will try to execute the mapper on the machines where the block resides.

• Because the file is replicated in HDFS, the least busy node with the data will be chosen.

• If all nodes holding the data are too busy, MapReduce will try to pick a node that is
closest to the node that hosts the data block.

Dept. Of CSE--Raghavendra T.K Page 11


Introduction to Big Data 21CS753

3. Combiner Step.

• It is possible to provide an optimization or pre-reduction as part of the map stage where


key–value pairs are combined prior to the next stage.

• The combiner stage is optional.

4. Shuffle Step.

• Before the parallel reduction stage can complete, all similar keys must be combined and
counted by the same reducer process.

• Therefore, results of the map stage must be collected by key–value pairs and shuffled to
the same reducer process.

• If only a single reducer process is used, the shuffle stage is not needed.

5. Reduce Step.

• The final step is the actual reduction. In this stage, the data reduction is performed as per
the programmer’s design.

• The results are written to HDFS. Each reducer will write an output file. For example, a
MapReduce job running four reducers will create files called part-0000, part-0001, part-
0002, and part-0003.

Figure is an example of a simple Hadoop MapReduce data flow for a wordcount program.
The map process counts the words in the split, and the reduceprocess calculates the total for
each word.

Dept. Of CSE--Raghavendra T.K Page 12


Introduction to Big Data 21CS753

The input to the MapReduce application is the following file in HDFS with three lines of text.
The goal is to count the number of times each word is used.
see spot run
run spot run
see the cat
Here, MapReduce will do is create the data splits. In the example,each line will be one split.
Since each split will require a map task, there are three mapper processes that count the number
of words in the split. Next, similar keys need to be collected and sent to a reducer process. The
shuffle step requires data movement.Once the data have been collected and sorted by key, the
reduction step can begin.

Combiner A combiner step enables some pre-reduction of the map outputdata. For instance, in
the previous example, one map produced the followingcounts:
(run,1)
(spot,1)
(run,1)
As shown in Figure, the count for run can be combined into (run,2)before the shuffle. This
optimization can help minimize the amount of datatransfer needed for the shuffle phase.

Dept. Of CSE--Raghavendra T.K Page 13


Introduction to Big Data 21CS753

Figure: Adding a combiner process to the map step in MapReduce

• Following figureshows a simple three-node MapReduce process. Once the mapping is


complete, the same nodes begin the reduce process. The shuffle stage makes sure the
necessary data are sent to each mapper.
• Also note that there is norequirement that all the mappers complete at the same time or
that the mapper ona specific node be complete before a reducer is started. Reducers can
be set to start shuffling based on a percentage of mappers that have finished.

Figure.Process placement during MapReduce

******

Dept. Of CSE--Raghavendra T.K Page 14


Introduction to Big Data 21CS753

MapReduce Programming
The classic Java WordCount program for Hadoop shown in following listing.

public class WordCount {


public static class TokenizerMapper{
private Text word = new Text();
public void map() {
StringTokenizeritr = new StringTokenizer();
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, 1);
}
}
}
public static class IntSumReducer {
privateIntWritable result = new IntWritable();
public void reduce(){
int sum = 0;
for (IntWritableval : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
}
}

Here,

Mapper Class (TokenizerMapper):The mapper reads the input text file line by line, breaks
each line into words, and assigns a count of 1 to each word.

Reducer Class (IntSumReducer):The reducer adds up the counts for each word, giving the total
number of times the word appears in the text file.

Main Method:This is the starting point of the program.

Dept. Of CSE--Raghavendra T.K Page 15


Introduction to Big Data 21CS753

How the Program Runs

Input: provide a text file as input. For example, a file that contains sentences like:
Hadoop is powerful. Hadoop is useful.
Mapper Output: The mapper splits the text into words and pairs each word with the number 1.
So, for the above text, the output would look like this:
(Hadoop, 1), (is, 1), (powerful, 1), (Hadoop, 1), (is, 1), (useful, 1)
Reducer Output: The reducer adds up the 1s for each word. So, the final output would be:
Hadoop 2
is 2
powerful 1
useful 1

Assignment Questions

1. What is Big Data? Give an example.


2. Explain the design features of Hadoop Distributed File System (HDFS).
3. Describe the main components of HDFS.
4. Explain Map Reduce Parallel data flow steps./ Describe the MapReduce programming
model./ Explain Map Reduce Framework .
5. What are some common HDFS user commands and Explain their purposes?
6. Illustrate the process of HDFS block replication.
7. Write a note on HDFS Safe Mode and Rack Awareness
8. Explain i) NameNode High Availability
ii) HDFS NameNode Federation
iii) HDFS Checkpoints and Backups

Dept. Of CSE--Raghavendra T.K Page 16

You might also like