RTK Notes m1
RTK Notes m1
Introduction
What is Data?
Data refers to raw facts and figures. Examples include numbers, text, images, and sounds.
Types of Data
• Structured Data:
• Organized in rows and columns (e.g., databases, spreadsheets).
• Easily searchable and analyzable.
• Unstructured Data:
• Unstructured data is information that doesn't reside in a traditional row-column
database.
• social media comments, tweets, shares, posts, the YouTube videos users view,
and the WhatsApp text messages.
• Semi-Structured Data:
• Contains tags or markers to separate data elements (e.g., XML, JSON).
• Definition: Big Data refers to large, complex datasets that traditional data processing
software can't handle.
• Example 1 : Social media data, sensor data, E-mails, Zipped files, Web pages, etc.
Example 2 : FaceBook According to Facebook, its data system processes 500+ terabytes
of data daily. Facebook generates 2.7 billion Like actions per day and 300 million new
photos are uploaded daily. It has 2.38 billion users. Allows searching, recommendation.
Apache Hadoop
• Hadoop is one of the solution for the Big Data problems.
• Hadoop is an open-source Java based framework software framework developed by the
Apache Software Foundation.
• Hadoop is used for distributed storage and processing of large datasets using a network
of computers.
• Started in 2005, developed by Doug Cutting and Mike Cafarella.
Hadoop Framework
• The Hadoop framework refers to the core components used for distributed storage and
processing of large data sets across clusters of computers. The core components are:
HDFS (Hadoop Distributed File System): used for storing data in Hadoop cluster.
MapReduce: used for Parallel processing of large data sets.
YARN (Yet Another Resource Negotiator): used for Resource management.
Hadoop Ecosystem
• Hadoop Ecosystem includes various other tools that enhance data processing capabilities.
Example: Hive, Pig, HBase, Sqoop, Flume.
HDFS components
The design of HDFS is based on two types of nodes: NameNode and multiple DataNodes.
In a basic design, NameNode manages all the metadata needed to store and retrieve the actual
data from the DataNodes. The NameNode stores all metadata in memory. No data is actually
stored on the NameNode.
The slave(DataNodes) are responsible for serving read and write requests from the file system to
the clients.
When a client writes data, it first communicates with the NameNode and requests to create a file.
The NameNode determines how many blocks are needed and provides the client with the
DataNodes that will store the data.
As part of the storage process, the data blocks are replicated after they are written to the assigned
node. The NameNode will attempt to write replicas of the data blocks on nodes that are in other
separate racks. After the Data Node acknowledges that the file block replication is complete, the
client closes the file and informs the NameNode that the operation is complete.
The client requests a file from the NameNode, which returns the best DataNodes from which to
read the data. The client then access the data directly from the DataNodes. Thus, once the
metadata has been delivered to the client, the NameNode steps back and lets the conversation
between the client and the DataNodes proceed.
While data transfer is progressing, the NameNode also monitors the DataNodes by listening for
heartbeats sent from DataNodes. The lack of a heartbeat signal indicates a node failure. When a
DataNode fails , the NameNode will begin re-replicating the missing blocks.
– An image or information of the file system state when the NameNode was started. This file is
fsimage.
– A series of modifications done to the file system after starting the NameNode. This file is edits
file and reflect the changes made after the file was read.
The Secondary NameNode periodically downloads fsimage and edits files, joins them into a new
fsimage and uploads the new fsimage file to the NameNode. Thus, when the NameNode restarts,
the fsimage file is up-to-date.
• When HDFS writes a file, it is replicated across the cluster. For Hadoop clusters
containing more than eight DataNodes, the replication value is usually set to 3.
• The HDFS default block size is often 64MB. If a file of size 80MB is written to HDFS, a
64MB block and a 16MB block will be created.
• Figure provides an example of how a file is broken into blocks and replicated across the
cluster. In this case, a replication factor of 3 ensures that any one DataNode can fail and
the replicated blocks will be available on other nodes—and then subsequently re-
replicated on other DataNodes.
Rack Awareness
• Rack awareness about knowing where data is stored in a Hadoop system. It deals with
data locality which is moving computation to the node where data resides.
• Hadoop cluster will exhibit three levels of data locality:
• Data resides on the local machine .
• Data resides in the same rack.
• Data resides in a different rack.
• To protect against failures, the system makes copies of data and stores them across
different racks. So, if one rack fails, the data is still safe and available from another rack,
keeping the system running without losing data.
As shown in Figure, an HA Hadoop cluster has two (or more) separate NameNode machines.
Each machine is configured with exactly the same software.
One of the NameNode machines is in the Active state, and the other is in the Standby state. The
Active NameNode is responsible for all client HDFS operations in the cluster. The Standby
NameNode maintains enough state to provide a fast failover (if required).
Both the Active and Standby NameNodes receive block reports from the DataNodes. The Active
node also sends all file system edits to a quorum of Journal nodes. Journal nodes are systems for
storing edits. At least three physically separate Journal Node are required. This design will
enable the system to tolerate the failure of a single JournalNode machine.
The Standby node continuously reads the edits from the JournalNodes to ensure its namespace is
synchronized with that of the Active node. In the event of an Active NameNode failure, the
Standby node reads all remaining edits from the JournalNodes before promoting itself to the
Active state.
HDFS Checkpoints
• The NameNode stores the metadata of the HDFS file system in a file called fsimage.
• File systems modifications are written to an edits log file, and at startup the NameNode
merges the edits into a new fsimage.
HDFS Backups
• An HDFS BackupNode maintains an up-to-date copy of the metadata both in memory
and on disk.
• The BackupNode does not need to download the fsimage and edits files from the active
NameNode because it already has an up-to-date metadata state in memory.
• A NameNode supports one BackupNode at a time. No CheckpointNodes may be
registered if a Backup node is in use.
HDFS Snapshots
• HDFS snapshots are similar to backups, but are created by administrators using the hdfs
dfs -snapshot command.
• HDFS snapshots are read-only copies of the file system.
• Snapshots can be taken of a sub-tree of the file system or the entire file system.
• Snapshots can be used for data backup.Snapshot creation is instantaneous.
• Blocks on the DataNodes are not copied, because the snapshot files record the block list
and the file size.
• There is no data copying.
******
HDFS user commands
• The preferred way to interact with HDFS is through the hdfs command and will facilitate
navigation within HDFS.
• The following listing presents the full range of options that are available for the hdfs
command. In the next section, only portions of the dfs and hdfsadmin options are given.
MapReduce Model
• HDFS distributes and replicates data over multiple data nodes.
• Apache Hadoop MapReduce will try to move the mapping tasks to the data nodes that
contains the data slice. Results from each data slice are then combined in the reducer
step.
• In the MapReduce computation model, there are two stages: a mapping stage and a
reducing stage.
• In the mapping stage, a mapping procedure is applied to input data. The map is usually
some kind of filter or sorting process.
• For instance, assume we need to count how many times the name “Ram" appears in the
novel War and Peace. One solution is to gather 20 friends and give them each a section
of the book to search. This step is the map stage. The reduce phase happens when
everyone is done counting and sum the total as friends tell their counts.
• Now consider how this same process could be accomplished using the two simple UNIX
shell scripts shown.
Ram,315
Here, the mapper inputs a text file and then outputs data in a (key, value) pair (token-
name, count) format.The input to the script is the file and the key is Ram.The reducer
script takes these key–value pairs and combines the similar tokens and counts the total
number of instances. The result is a new key–value pair (token-name, sum).
• Distributed (parallel) implementations of MapReduce enable large amounts of data to be
analyzed quickly. In general, the mapper process is fully scalable and can be applied to
any subset of the input data. Results from multiple parallel mapping functions are then
combined in the reducer phase.
• Hadoop accomplishes parallelism by using a distributed file system (HDFS) to slice and
spread data over multiple data nodes.
• Apache Hadoop MapReduce will try to move the mapping tasks to the data nodes that
contains the data slice.Results from each data slice are then combined in the reducer step.
• Parallel execution of MapReduce requires other steps in addition to the mapper and
reducer processes.
1. Input Splits.
• The default data block size is 64MB. Thus, a 500MB file would be broken into 8 blocks
and written to different machines in the cluster.
• The data are also replicated on multiple machines (typically three machines).
2. Map Step.
• MapReduce will try to execute the mapper on the machines where the block resides.
• Because the file is replicated in HDFS, the least busy node with the data will be chosen.
• If all nodes holding the data are too busy, MapReduce will try to pick a node that is
closest to the node that hosts the data block.
3. Combiner Step.
4. Shuffle Step.
• Before the parallel reduction stage can complete, all similar keys must be combined and
counted by the same reducer process.
• Therefore, results of the map stage must be collected by key–value pairs and shuffled to
the same reducer process.
• If only a single reducer process is used, the shuffle stage is not needed.
5. Reduce Step.
• The final step is the actual reduction. In this stage, the data reduction is performed as per
the programmer’s design.
• The results are written to HDFS. Each reducer will write an output file. For example, a
MapReduce job running four reducers will create files called part-0000, part-0001, part-
0002, and part-0003.
Figure is an example of a simple Hadoop MapReduce data flow for a wordcount program.
The map process counts the words in the split, and the reduceprocess calculates the total for
each word.
The input to the MapReduce application is the following file in HDFS with three lines of text.
The goal is to count the number of times each word is used.
see spot run
run spot run
see the cat
Here, MapReduce will do is create the data splits. In the example,each line will be one split.
Since each split will require a map task, there are three mapper processes that count the number
of words in the split. Next, similar keys need to be collected and sent to a reducer process. The
shuffle step requires data movement.Once the data have been collected and sorted by key, the
reduction step can begin.
Combiner A combiner step enables some pre-reduction of the map outputdata. For instance, in
the previous example, one map produced the followingcounts:
(run,1)
(spot,1)
(run,1)
As shown in Figure, the count for run can be combined into (run,2)before the shuffle. This
optimization can help minimize the amount of datatransfer needed for the shuffle phase.
******
MapReduce Programming
The classic Java WordCount program for Hadoop shown in following listing.
Here,
Mapper Class (TokenizerMapper):The mapper reads the input text file line by line, breaks
each line into words, and assigns a count of 1 to each word.
Reducer Class (IntSumReducer):The reducer adds up the counts for each word, giving the total
number of times the word appears in the text file.
Input: provide a text file as input. For example, a file that contains sentences like:
Hadoop is powerful. Hadoop is useful.
Mapper Output: The mapper splits the text into words and pairs each word with the number 1.
So, for the above text, the output would look like this:
(Hadoop, 1), (is, 1), (powerful, 1), (Hadoop, 1), (is, 1), (useful, 1)
Reducer Output: The reducer adds up the 1s for each word. So, the final output would be:
Hadoop 2
is 2
powerful 1
useful 1
Assignment Questions