0% found this document useful (0 votes)

54 views16 pages

RTK Notes m1

Uploaded by

DARSHAN DARSH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views16 pages

RTK Notes m1

Uploaded by

DARSHAN DARSH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Introduction to Big Data 21CS753

Introduction to Big Data

Introduction
What is Data?
Data refers to raw facts and figures. Examples include numbers, text, images, and sounds.
Types of Data
• Structured Data:
• Organized in rows and columns (e.g., databases, spreadsheets).
• Easily searchable and analyzable.
• Unstructured Data:
• Unstructured data is information that doesn't reside in a traditional row-column
database.
• social media comments, tweets, shares, posts, the YouTube videos users view,
and the WhatsApp text messages.
• Semi-Structured Data:
• Contains tags or markers to separate data elements (e.g., XML, JSON).

Dept. Of CSE--Raghavendra T.K Page 1

Introduction to Big Data 21CS753

What is Big Data?

• Definition: Big Data refers to large, complex datasets that traditional data processing
software can't handle.
• Example 1 : Social media data, sensor data, E-mails, Zipped files, Web pages, etc.
Example 2 : FaceBook According to Facebook, its data system processes 500+ terabytes
of data daily. Facebook generates 2.7 billion Like actions per day and 300 million new
photos are uploaded daily. It has 2.38 billion users. Allows searching, recommendation.

Characteristics of Big Data (The 4 Vs)

• Volume: The amount of data generated.
• Velocity: The speed at which data is generated and processed.
• Variety: Different types of data (structured, unstructured, semi-structured).
• Veracity: The trust worthiness and quality of data.

Challenges of Big Data

• Data Storage: Big Data involves a huge amount of information that traditional storage
systems can't handle. We need advanced storage solutions.
• Data Processing: The speed at which data is created is so fast that normal computers
struggle to process it. We use special tools like Hadoop or Spark to process this data
quickly and efficiently.
• Data Analysis: Analyzing vast amounts of structured and unstructured data can be
complex. Advanced algorithms, machine learning models, and powerful computational
resources are essential to extract meaningful insights from Big Data.
• Data Security and Privacy: With big data, keeping it safe from hackers and making sure
personal information is protected is a big challenge. We need strong security measures.

Apache Hadoop
• Hadoop is one of the solution for the Big Data problems.
• Hadoop is an open-source Java based framework software framework developed by the
Apache Software Foundation.
• Hadoop is used for distributed storage and processing of large datasets using a network
of computers.
• Started in 2005, developed by Doug Cutting and Mike Cafarella.
Hadoop Framework
• The Hadoop framework refers to the core components used for distributed storage and
processing of large data sets across clusters of computers. The core components are:
HDFS (Hadoop Distributed File System): used for storing data in Hadoop cluster.
MapReduce: used for Parallel processing of large data sets.
YARN (Yet Another Resource Negotiator): used for Resource management.

Dept. Of CSE--Raghavendra T.K Page 2

Introduction to Big Data 21CS753

Hadoop Ecosystem
• Hadoop Ecosystem includes various other tools that enhance data processing capabilities.
Example: Hive, Pig, HBase, Sqoop, Flume.

Module 1 Hadoop Distributed File System

Hadoop Distributed File system design features:

• The Hadoop Distributed file system(HDFS) was designed for Big Data processing and is
capable of supporting many users simultaneously. The design assumes a large file write-
once/read-many model. HDFS restricts data writing to one user at a time. All additional
writes are “append-only,” and there is no random writing to HDFS files.
• HDFS is designed for data streaming where large amounts of data are read from disk in
bulk. The HDFS block size is typically 64MB or 128MB.
• There is no local caching mechanism. The large block and file sizes makes it more
efficient to reread data from HDFS than to try to cache the data.
• Hadoop MapReduce moves the computation to the data rather than moving the data to the
computation. That is, Converged data storage and processing happen on the same server
or data nodes.
• A reliable file system maintains multiple copies of data across the cluster. Consequently,
failure of a single will not bring down the file system.
• A specialized file system is used, which is not designed for general use.

HDFS components
The design of HDFS is based on two types of nodes: NameNode and multiple DataNodes.

In a basic design, NameNode manages all the metadata needed to store and retrieve the actual
data from the DataNodes. The NameNode stores all metadata in memory. No data is actually
stored on the NameNode.

The design is a Master/Slave architecture in which master(NameNode) regulates access to files

by clients. File system operations such as opening, closing and renaming files and directories are
all managed by the NameNode. The NameNode also determines the mapping of blocks to
DataNodes and handles DataNode failures. The NameNode manages block creation, deletion and
replication.

Dept. Of CSE--Raghavendra T.K Page 3

Introduction to Big Data 21CS753

The slave(DataNodes) are responsible for serving read and write requests from the file system to
the clients.

An example of the client/NameNode/DataNode interaction is provided in figure.

When a client writes data, it first communicates with the NameNode and requests to create a file.
The NameNode determines how many blocks are needed and provides the client with the
DataNodes that will store the data.

As part of the storage process, the data blocks are replicated after they are written to the assigned
node. The NameNode will attempt to write replicas of the data blocks on nodes that are in other
separate racks. After the Data Node acknowledges that the file block replication is complete, the
client closes the file and informs the NameNode that the operation is complete.

The client requests a file from the NameNode, which returns the best DataNodes from which to
read the data. The client then access the data directly from the DataNodes. Thus, once the
metadata has been delivered to the client, the NameNode steps back and lets the conversation
between the client and the DataNodes proceed.

While data transfer is progressing, the NameNode also monitors the DataNodes by listening for
heartbeats sent from DataNodes. The lack of a heartbeat signal indicates a node failure. When a
DataNode fails , the NameNode will begin re-replicating the missing blocks.

In almost all Hadoop deployments, there is a SecondaryNameNode(Checkpoint Node). The

purpose of this node is to perform periodic checkpoints that evaluate the status of the
NameNode. It has two disk files that track changes to the metadata:

Dept. Of CSE--Raghavendra T.K Page 4

Introduction to Big Data 21CS753

– An image or information of the file system state when the NameNode was started. This file is
fsimage.
– A series of modifications done to the file system after starting the NameNode. This file is edits
file and reflect the changes made after the file was read.

The Secondary NameNode periodically downloads fsimage and edits files, joins them into a new
fsimage and uploads the new fsimage file to the NameNode. Thus, when the NameNode restarts,
the fsimage file is up-to-date.

Finally, the various roles in HDFS can be summarized as follows:

• HDFS uses a master/slave architecture to design large file reading/streaming.

• The NameNode is a metadata server or “data traffic cop.”
• HDFS provides a single namespace that is managed by the NameNode.
• Data is redundantly stored on DataNodes; there is no data on the NameNode.
• The SecondaryNameNode performs checkpoints of the NameNode file system’s state
but is not a failover node.

HDFS Block Replication

• When HDFS writes a file, it is replicated across the cluster. For Hadoop clusters
containing more than eight DataNodes, the replication value is usually set to 3.
• The HDFS default block size is often 64MB. If a file of size 80MB is written to HDFS, a
64MB block and a 16MB block will be created.
• Figure provides an example of how a file is broken into blocks and replicated across the
cluster. In this case, a replication factor of 3 ensures that any one DataNode can fail and
the replicated blocks will be available on other nodes—and then subsequently re-
replicated on other DataNodes.

Figure: HDFS block replication example

Dept. Of CSE--Raghavendra T.K Page 5

Introduction to Big Data 21CS753

HDFS Safe Mode

• When the NameNode starts, it enters a read-only safe mode where blocks cannot be
replicated or deleted. Safe Mode enables the NameNode to perform two important
processes:
• The previous file system state is reconstructed by loading the fsimage file into memory
and replaying the edit log.
• The mapping between blocks and data nodes is created by waiting for enough Of the
DataNodes to register so that at least one copy of the data is available.

Rack Awareness

• Rack awareness about knowing where data is stored in a Hadoop system. It deals with
data locality which is moving computation to the node where data resides.
• Hadoop cluster will exhibit three levels of data locality:
• Data resides on the local machine .
• Data resides in the same rack.
• Data resides in a different rack.
• To protect against failures, the system makes copies of data and stores them across
different racks. So, if one rack fails, the data is still safe and available from another rack,
keeping the system running without losing data.

NameNode High Availability

As shown in Figure, an HA Hadoop cluster has two (or more) separate NameNode machines.
Each machine is configured with exactly the same software.

One of the NameNode machines is in the Active state, and the other is in the Standby state. The
Active NameNode is responsible for all client HDFS operations in the cluster. The Standby
NameNode maintains enough state to provide a fast failover (if required).

Dept. Of CSE--Raghavendra T.K Page 6

Introduction to Big Data 21CS753

Both the Active and Standby NameNodes receive block reports from the DataNodes. The Active
node also sends all file system edits to a quorum of Journal nodes. Journal nodes are systems for
storing edits. At least three physically separate Journal Node are required. This design will
enable the system to tolerate the failure of a single JournalNode machine.

The Standby node continuously reads the edits from the JournalNodes to ensure its namespace is
synchronized with that of the Active node. In the event of an Active NameNode failure, the
Standby node reads all remaining edits from the JournalNodes before promoting itself to the
Active state.

HDFS NameNode Federation

Another important feature of HDFS is NameNode Federation. Federation supports for multiple
NameNodes/namespaces to the HDFS file system. The key benefits are as follows:
• Namespace scalability. HDFS cluster storage scales horizontally without placing a
burden on the NameNode.
• Better performance. Adding more NameNodes to the cluster scales the file system
read/write operations throughput by separating the total namespace.
• System isolation. Multiple NameNodes enable different categories of applications to be
distinguished, and users can be isolated to different namespaces.
Figure illustrates how HDFS NameNode Federation is accomplished.
• NameNode1 manages the research and marketing namespaces, and NameNode2 manages
the data and project namespaces. The NameNodes do not communicate with each other
and the DataNodes “just store data block” as directed by either NameNode.

HDFS Checkpoints
• The NameNode stores the metadata of the HDFS file system in a file called fsimage.
• File systems modifications are written to an edits log file, and at startup the NameNode
merges the edits into a new fsimage.

Dept. Of CSE--Raghavendra T.K Page 7

Introduction to Big Data 21CS753

• The SecondaryNameNode or CheckpointNode periodioally fetches edits from the

NameNode, merges them, and returns an updated fsimage to the NameNode.

HDFS Backups
• An HDFS BackupNode maintains an up-to-date copy of the metadata both in memory
and on disk.
• The BackupNode does not need to download the fsimage and edits files from the active
NameNode because it already has an up-to-date metadata state in memory.
• A NameNode supports one BackupNode at a time. No CheckpointNodes may be
registered if a Backup node is in use.
HDFS Snapshots
• HDFS snapshots are similar to backups, but are created by administrators using the hdfs
dfs -snapshot command.
• HDFS snapshots are read-only copies of the file system.
• Snapshots can be taken of a sub-tree of the file system or the entire file system.
• Snapshots can be used for data backup.Snapshot creation is instantaneous.
• Blocks on the DataNodes are not copied, because the snapshot files record the block list
and the file size.
• There is no data copying.
******
HDFS user commands
• The preferred way to interact with HDFS is through the hdfs command and will facilitate
navigation within HDFS.
• The following listing presents the full range of options that are available for the hdfs
command. In the next section, only portions of the dfs and hdfsadmin options are given.

General HDFS Commands

The version of HDFS can be found from the version option.
$ hdfs version
Hadoop 2.6.0.2.2.4.2-2

Dept. Of CSE--Raghavendra T.K Page 8

Introduction to Big Data 21CS753

List Files in HDFS

To list the files in the root HDFS directory, enter the following command:
$ hdfsdfs -ls /
Output:
Found 2 items
drwxrwxrwx - yarn hadoop 0 2015-04-29 16:52 /app-logs
drwxr-xr-x - hdfshdfs 0 2015-04-21 14:28 /apps

To list files in our home directory, enter the following command:

Syntax: $ hdfsdfs -ls
Output:
Found 2 items
drwxr-xr-x - hdfshdfs 0 2015-05-24 20:06 bin
drwxr-xr-x - hdfshdfs 0 2015-04-29 16:52

Make a Directory in HDFS

To make a directory in HDFS, use the following command.
$ hdfs dfs -mkdir stuff
Copy Files to HDFS
• To copy a file from your current local directory into HDFS, use the following command.
• If a full path is not supplied, your home directory is assumed.
• In this case, the file test is placed in the directory stuff that was created previously.
$ hdfs dfs -put test stuff
• The file transfer can be confirmed by using the -ls command:
$ hdfs dfs -ls stuff
Found 1 items
-rw-r--r-- 2 hdfs hdfs 12857 2015-05-29 13:12 stuff/test
Copy Files from HDFS
• Files can be copied back to your local file system using the following command.
• In this case, the file test from HDFS, will be copied back to the current local directory
with the name test-local.
$ hdfs dfs -get stuff/test test-local
Copy Files within HDFS
The following command will copy a file in HDFS
$ hdfs dfs -cp stuff/test test.hdfs
Delete a File within HDFS
The following command will delete the HDFS file test.hdfs
$ hdfs dfs -rm test.hdfs
Delete a Directory in HDFS
The following command will delete the HDFS directory stuff and all its contents:
$ hdfs dfs -rm -r -skipTrash stuff
Deleted stuff

Dept. Of CSE--Raghavendra T.K Page 9

Introduction to Big Data 21CS753

Get an HDFS Status Report

HDFS status report using the following command. Those with HDFS administrator privileges
will get a full report. Also, this command uses dfsadmin instead of dfs to invoke administrative
commands.
$ hdfs dfsadmin -report
Configured Capacity: 1503409881088 (1.37 TB)
Present Capacity: 1407945981952 (1.28 TB)
DFS Remaining: 1255510564864 (1.14 TB)
DFS Used: 152435417088 (141.97 GB)
DFS Used%: 10.83%
Under replicated blocks: 54
Blocks with corrupt replicas: 0
Missing blocks: 0
******
Hadoop MapReduce Framework

MapReduce Model
• HDFS distributes and replicates data over multiple data nodes.
• Apache Hadoop MapReduce will try to move the mapping tasks to the data nodes that
contains the data slice. Results from each data slice are then combined in the reducer
step.
• In the MapReduce computation model, there are two stages: a mapping stage and a
reducing stage.

• In the mapping stage, a mapping procedure is applied to input data. The map is usually
some kind of filter or sorting process.
• For instance, assume we need to count how many times the name “Ram" appears in the
novel War and Peace. One solution is to gather 20 friends and give them each a section
of the book to search. This step is the map stage. The reduce phase happens when
everyone is done counting and sum the total as friends tell their counts.
• Now consider how this same process could be accomplished using the two simple UNIX
shell scripts shown.

$ cat war-and-peace.txt | ./mapper.sh | ./reducer.sh

Dept. Of CSE--Raghavendra T.K Page 10

Introduction to Big Data 21CS753

Ram,315

Here, the mapper inputs a text file and then outputs data in a (key, value) pair (token-
name, count) format.The input to the script is the file and the key is Ram.The reducer
script takes these key–value pairs and combines the similar tokens and counts the total
number of instances. The result is a new key–value pair (token-name, sum).
• Distributed (parallel) implementations of MapReduce enable large amounts of data to be
analyzed quickly. In general, the mapper process is fully scalable and can be applied to
any subset of the input data. Results from multiple parallel mapping functions are then
combined in the reducer phase.
• Hadoop accomplishes parallelism by using a distributed file system (HDFS) to slice and
spread data over multiple data nodes.

MapReduce Parallel Data Flow

• HDFS distributes and replicates data over multiple data nodes.

• Apache Hadoop MapReduce will try to move the mapping tasks to the data nodes that
contains the data slice.Results from each data slice are then combined in the reducer step.

• Parallel execution of MapReduce requires other steps in addition to the mapper and
reducer processes.

• The basic steps are as follows:

1. Input Splits.

• HDFS distributes and replicates data over multiple servers.

• The default data block size is 64MB. Thus, a 500MB file would be broken into 8 blocks
and written to different machines in the cluster.

• The data are also replicated on multiple machines (typically three machines).

2. Map Step.

• The user provides the specific mapping process.

• MapReduce will try to execute the mapper on the machines where the block resides.

• Because the file is replicated in HDFS, the least busy node with the data will be chosen.

• If all nodes holding the data are too busy, MapReduce will try to pick a node that is
closest to the node that hosts the data block.

Dept. Of CSE--Raghavendra T.K Page 11

Introduction to Big Data 21CS753

3. Combiner Step.

• It is possible to provide an optimization or pre-reduction as part of the map stage where

key–value pairs are combined prior to the next stage.

• The combiner stage is optional.

4. Shuffle Step.

• Before the parallel reduction stage can complete, all similar keys must be combined and
counted by the same reducer process.

• Therefore, results of the map stage must be collected by key–value pairs and shuffled to
the same reducer process.

• If only a single reducer process is used, the shuffle stage is not needed.

5. Reduce Step.

• The final step is the actual reduction. In this stage, the data reduction is performed as per
the programmer’s design.

• The results are written to HDFS. Each reducer will write an output file. For example, a
MapReduce job running four reducers will create files called part-0000, part-0001, part-
0002, and part-0003.

Figure is an example of a simple Hadoop MapReduce data flow for a wordcount program.
The map process counts the words in the split, and the reduceprocess calculates the total for
each word.

Dept. Of CSE--Raghavendra T.K Page 12

Introduction to Big Data 21CS753

The input to the MapReduce application is the following file in HDFS with three lines of text.
The goal is to count the number of times each word is used.
see spot run
run spot run
see the cat
Here, MapReduce will do is create the data splits. In the example,each line will be one split.
Since each split will require a map task, there are three mapper processes that count the number
of words in the split. Next, similar keys need to be collected and sent to a reducer process. The
shuffle step requires data movement.Once the data have been collected and sorted by key, the
reduction step can begin.

Combiner A combiner step enables some pre-reduction of the map outputdata. For instance, in
the previous example, one map produced the followingcounts:
(run,1)
(spot,1)
(run,1)
As shown in Figure, the count for run can be combined into (run,2)before the shuffle. This
optimization can help minimize the amount of datatransfer needed for the shuffle phase.

Dept. Of CSE--Raghavendra T.K Page 13

Introduction to Big Data 21CS753

Figure: Adding a combiner process to the map step in MapReduce

• Following figureshows a simple three-node MapReduce process. Once the mapping is

complete, the same nodes begin the reduce process. The shuffle stage makes sure the
necessary data are sent to each mapper.
• Also note that there is norequirement that all the mappers complete at the same time or
that the mapper ona specific node be complete before a reducer is started. Reducers can
be set to start shuffling based on a percentage of mappers that have finished.

Figure.Process placement during MapReduce

******

Dept. Of CSE--Raghavendra T.K Page 14

Introduction to Big Data 21CS753

MapReduce Programming
The classic Java WordCount program for Hadoop shown in following listing.

public class WordCount {

public static class TokenizerMapper{
private Text word = new Text();
public void map() {
StringTokenizeritr = new StringTokenizer();
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, 1);
}
}
}
public static class IntSumReducer {
privateIntWritable result = new IntWritable();
public void reduce(){
int sum = 0;
for (IntWritableval : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
}
}

Here,

Mapper Class (TokenizerMapper):The mapper reads the input text file line by line, breaks
each line into words, and assigns a count of 1 to each word.

Reducer Class (IntSumReducer):The reducer adds up the counts for each word, giving the total
number of times the word appears in the text file.

Main Method:This is the starting point of the program.

Dept. Of CSE--Raghavendra T.K Page 15

Introduction to Big Data 21CS753

How the Program Runs

Input: provide a text file as input. For example, a file that contains sentences like:
Hadoop is powerful. Hadoop is useful.
Mapper Output: The mapper splits the text into words and pairs each word with the number 1.
So, for the above text, the output would look like this:
(Hadoop, 1), (is, 1), (powerful, 1), (Hadoop, 1), (is, 1), (useful, 1)
Reducer Output: The reducer adds up the 1s for each word. So, the final output would be:
Hadoop 2
is 2
powerful 1
useful 1

Assignment Questions

1. What is Big Data? Give an example.

2. Explain the design features of Hadoop Distributed File System (HDFS).
3. Describe the main components of HDFS.
4. Explain Map Reduce Parallel data flow steps./ Describe the MapReduce programming
model./ Explain Map Reduce Framework .
5. What are some common HDFS user commands and Explain their purposes?
6. Illustrate the process of HDFS block replication.
7. Write a note on HDFS Safe Mode and Rack Awareness
8. Explain i) NameNode High Availability
ii) HDFS NameNode Federation
iii) HDFS Checkpoints and Backups

Dept. Of CSE--Raghavendra T.K Page 16

Big Data Refers To Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers To Extremely Large and Complex Datasets That 1
421 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
Module-2 21ec33 Notes Updated
No ratings yet
Module-2 21ec33 Notes Updated
51 pages
BD Unit II
No ratings yet
BD Unit II
57 pages
Bigdata Module2 7th-Sem 18cs72
No ratings yet
Bigdata Module2 7th-Sem 18cs72
64 pages
Lec 5 - Big Data Storage Technologies I - Hadoop
No ratings yet
Lec 5 - Big Data Storage Technologies I - Hadoop
44 pages
Bda 3
No ratings yet
Bda 3
70 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
BDA Unit 2
No ratings yet
BDA Unit 2
29 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Bigdata-7 Ahs Merged
No ratings yet
Bigdata-7 Ahs Merged
45 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
Module II
No ratings yet
Module II
46 pages
HCIA Big Data
No ratings yet
HCIA Big Data
20 pages
Hadoop Class 1 PDF
No ratings yet
Hadoop Class 1 PDF
27 pages
Bda Viva Questions
No ratings yet
Bda Viva Questions
8 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
Notes
88% (8)
Notes
18 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
(17CS82) 8 Semester CSE: Big Data Analytics
No ratings yet
(17CS82) 8 Semester CSE: Big Data Analytics
169 pages
Chap4 BigDataStorageAndManagement
No ratings yet
Chap4 BigDataStorageAndManagement
46 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Cloud Computing - Unit 3
No ratings yet
Cloud Computing - Unit 3
38 pages
Unit 2
No ratings yet
Unit 2
14 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
BDA Exp 1
No ratings yet
BDA Exp 1
7 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
Bigdata
No ratings yet
Bigdata
3 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
Hadoop Major Components
No ratings yet
Hadoop Major Components
10 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
52 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Unit 5 Print
No ratings yet
Unit 5 Print
32 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Big Data
No ratings yet
Big Data
51 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Unit 2
No ratings yet
Unit 2
56 pages
The Solution For Big Data Hadoop
No ratings yet
The Solution For Big Data Hadoop
27 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
4 UNIT-4 Introduction To Hadoop
No ratings yet
4 UNIT-4 Introduction To Hadoop
154 pages
Big-Data Final
No ratings yet
Big-Data Final
7 pages
Spark by IBM
No ratings yet
Spark by IBM
80 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Efficient Ways To Improve The Performance of HDFS For Small Files
No ratings yet
Efficient Ways To Improve The Performance of HDFS For Small Files
5 pages
BDA Lab Manual R22
0% (1)
BDA Lab Manual R22
70 pages
Hadoop - The Final Product
100% (2)
Hadoop - The Final Product
42 pages
IRJET - Big Data-A Review Study With Comp
No ratings yet
IRJET - Big Data-A Review Study With Comp
6 pages
Introduction To Big data-21CS753-syllabus
No ratings yet
Introduction To Big data-21CS753-syllabus
3 pages
21EC62 Model Question Paper
No ratings yet
21EC62 Model Question Paper
5 pages
KIT MaFoi 2025 Batch
No ratings yet
KIT MaFoi 2025 Batch
32 pages
Unit 4 HIVE - PIG
No ratings yet
Unit 4 HIVE - PIG
71 pages
21EC61 Model Question Paper
50% (2)
21EC61 Model Question Paper
2 pages
Big Data Lab Manual 8th Sem
No ratings yet
Big Data Lab Manual 8th Sem
94 pages
Case Study On Netflix
No ratings yet
Case Study On Netflix
20 pages
An Introduction To Hadoop
No ratings yet
An Introduction To Hadoop
12 pages
Network Management KPI's
No ratings yet
Network Management KPI's
66 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
21ec741 Iot & WSN
100% (1)
21ec741 Iot & WSN
1 page
DC Module2 Notes
No ratings yet
DC Module2 Notes
32 pages
3-1 Bigdata (Spark)
No ratings yet
3-1 Bigdata (Spark)
3 pages
Chp3 Data Warehouse and Hadoop
No ratings yet
Chp3 Data Warehouse and Hadoop
49 pages
DocScanner 27 Jun 2024 9 47 PM
No ratings yet
DocScanner 27 Jun 2024 9 47 PM
42 pages
Python Module5 Notes
No ratings yet
Python Module5 Notes
36 pages
CSE545 Sp23 (2) Streaming Algorithms 2-4
No ratings yet
CSE545 Sp23 (2) Streaming Algorithms 2-4
60 pages
VLSI Assignment 1
No ratings yet
VLSI Assignment 1
16 pages
Mitt Robo Challenge - 20241031 - 113639 - 0000
No ratings yet
Mitt Robo Challenge - 20241031 - 113639 - 0000
15 pages
Big Data Processing, 2014/15: Lecture 8: Pig Latin!
No ratings yet
Big Data Processing, 2014/15: Lecture 8: Pig Latin!
58 pages
ICM Module 2 MCQ
No ratings yet
ICM Module 2 MCQ
63 pages
College Fest014
No ratings yet
College Fest014
5 pages
084 Liza Bda File
No ratings yet
084 Liza Bda File
23 pages
2 BDA MapReduce
No ratings yet
2 BDA MapReduce
30 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
21EC63 Module 4B
No ratings yet
21EC63 Module 4B
29 pages
English Sample Exam Exin CCC BDF 201606 PDF
No ratings yet
English Sample Exam Exin CCC BDF 201606 PDF
28 pages
Question 2
No ratings yet
Question 2
3 pages
Cloud Computing Presentation Notes
No ratings yet
Cloud Computing Presentation Notes
60 pages
CSS Notes
No ratings yet
CSS Notes
3 pages
CIT-650 Introduction To Big Data, Developing With Spark and Hadoop
No ratings yet
CIT-650 Introduction To Big Data, Developing With Spark and Hadoop
4 pages
21 Scheme Nss All Cerificates
No ratings yet
21 Scheme Nss All Cerificates
4 pages
Module#2
No ratings yet
Module#2
1 page
Efficient Parallel Set-Similarity Joins Using MapReduce
No ratings yet
Efficient Parallel Set-Similarity Joins Using MapReduce
54 pages
Palak
No ratings yet
Palak
10 pages
Module 1 Question Bank Dspa - Kms
No ratings yet
Module 1 Question Bank Dspa - Kms
1 page
Implementation of Machine Learning Techniques With Big Data and IoT To Create Effective Prediction Models For Health Informatics
No ratings yet
Implementation of Machine Learning Techniques With Big Data and IoT To Create Effective Prediction Models For Health Informatics
17 pages
Python Module
No ratings yet
Python Module
2 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
20 pages
BDA IV B.Tech I Sem MR18-Mid-2 Objective Questions
No ratings yet
BDA IV B.Tech I Sem MR18-Mid-2 Objective Questions
11 pages
HTML Notes
No ratings yet
HTML Notes
2 pages
Dimensionnement Spark - Les 5 Erreurs À Éviter
No ratings yet
Dimensionnement Spark - Les 5 Erreurs À Éviter
75 pages
Shard Manager
No ratings yet
Shard Manager
17 pages
Apriori Mapreduce
No ratings yet
Apriori Mapreduce
6 pages
HEPDOOP High-Energy Physics Analysis Using Hadoop
No ratings yet
HEPDOOP High-Energy Physics Analysis Using Hadoop
6 pages
Assignment 4
No ratings yet
Assignment 4
6 pages
Exploiting Dynamic Resource Allocation For Efficient Parallel Data Processing in The Cloud
No ratings yet
Exploiting Dynamic Resource Allocation For Efficient Parallel Data Processing in The Cloud
14 pages
Big Data Analytics: September 2015
No ratings yet
Big Data Analytics: September 2015
11 pages
A Discretization Method For Industrial Data Based On Big Data Technology
No ratings yet
A Discretization Method For Industrial Data Based On Big Data Technology
3 pages

RTK Notes m1

Uploaded by

RTK Notes m1

Uploaded by

Introduction to Big Data 21CS753

Introduction to Big Data

Dept. Of CSE--Raghavendra T.K Page 1

What is Big Data?

Characteristics of Big Data (The 4 Vs)

Challenges of Big Data

Dept. Of CSE--Raghavendra T.K Page 2

Module 1 Hadoop Distributed File System

Hadoop Distributed File system design features:

The design is a Master/Slave architecture in which master(NameNode) regulates access to files

Dept. Of CSE--Raghavendra T.K Page 3

An example of the client/NameNode/DataNode interaction is provided in figure.

In almost all Hadoop deployments, there is a SecondaryNameNode(Checkpoint Node). The

Dept. Of CSE--Raghavendra T.K Page 4

Finally, the various roles in HDFS can be summarized as follows:

• HDFS uses a master/slave architecture to design large file reading/streaming.

HDFS Block Replication

Figure: HDFS block replication example

Dept. Of CSE--Raghavendra T.K Page 5

HDFS Safe Mode

NameNode High Availability

Dept. Of CSE--Raghavendra T.K Page 6

HDFS NameNode Federation

Dept. Of CSE--Raghavendra T.K Page 7

• The SecondaryNameNode or CheckpointNode periodioally fetches edits from the

General HDFS Commands

Dept. Of CSE--Raghavendra T.K Page 8

List Files in HDFS

To list files in our home directory, enter the following command:

Make a Directory in HDFS

Dept. Of CSE--Raghavendra T.K Page 9

Get an HDFS Status Report

$ cat war-and-peace.txt | ./mapper.sh | ./reducer.sh

Dept. Of CSE--Raghavendra T.K Page 10

MapReduce Parallel Data Flow

• The basic steps are as follows:

• HDFS distributes and replicates data over multiple servers.

• The user provides the specific mapping process.

Dept. Of CSE--Raghavendra T.K Page 11

• It is possible to provide an optimization or pre-reduction as part of the map stage where

• The combiner stage is optional.

Dept. Of CSE--Raghavendra T.K Page 12

Dept. Of CSE--Raghavendra T.K Page 13

Figure: Adding a combiner process to the map step in MapReduce

• Following figureshows a simple three-node MapReduce process. Once the mapping is

Figure.Process placement during MapReduce

Dept. Of CSE--Raghavendra T.K Page 14

public class WordCount {

Main Method:This is the starting point of the program.

Dept. Of CSE--Raghavendra T.K Page 15

How the Program Runs

1. What is Big Data? Give an example.

Dept. Of CSE--Raghavendra T.K Page 16

You might also like