Big Data Unit 2
Big Data Unit 2
2
WHAT IS HADOOP
Hadoop is an open source, Java-based programming framework that supports the
processing and storage of extremely large data sets in a distributed computing
environment. It is part of the Apache project sponsored by the Apache Software
Foundation.
Able to store of huge amount of data from variety of Sources
Bigger tasks are divided in to smaller tasks and these are processed
and analyzed parallelly and later regrouped for final result.
Hadoop Ecosystem
• Hadoop Ecosystem is a collection of tools that
provides services like ingestion, storage, processing
and analysis of big data. Various components of
Hadoop are
HADOOP-ECOSYSTEM
Core components of Hadoop
The core of Apache Hadoop consists of a storage part, known as Hadoop
Distributed File System (HDFS), and a processing part which is a MapReduce
programming model.
• Hadoop Distributed File System (HDFS) – a distributed file-system that
stores data on commodity machines, providing very high aggregate
bandwidth across the cluster.
• Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
• HDFS holds very large amount of data and provides easier access.
• To store such huge data, the files are stored across multiple machines.
• HDFS also makes applications available to parallel processing.
HDFS
• It is suitable for the distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• HDFS provides file permissions and authentication
• HDFS has two types of component Nodes
• NameNode and
• DataNode.
• NameNode act as a master node and maintains all the
information regarding Data nodes and other blocks of HDFS.
It keeps track regularly on Data nodes to check whether they
are alive or not.
Hadoop MapReduce –
• MapReduce is a software framework in which is applications are
created in different programming languages to process large data set.
It is based on principle of Map (distribute) and reduce (aggregates).
• An implementation of the MapReduce programming model for large
scale data processing.
YARN (Yet Another Resource Negotiator):
Pig was initially developed by Yahoo. It performs the extraction, transformation, processing and
analysis of huge data. Pig has two components
o The Pig Runtime which provides the execution environment for Pig Latin.
HIVE
Hive is basically used for analysis of big data sets using HQL (Hive
Query Language). HQL is a programming language for HIVE and
supports all data types of SQL. It supports both batch processing as
well as real time applications. HIVE has two components
o Hive Command Line is used to execute Hive commands
o JDBC/ODBC Driver provides the connection for data processing.
Other Hadoop Components
Mahout: It provides environment for creating machine learning applications for analysis using
clustering, filtering and classification.
HBase: It is open source, NoSQL database which runs at the top of HDFS. It is written in Java.
Zookeeper: It is a server which works at central level to provide co-ordination between different
applications.
Apache Flume: Flume is used to ingest unstructured as well as semi-structured data from RDBMS/
Datawarehouse to HDFS and vice-versa.
Apache Sqoop: Sqoop is also data ingestion tool that is used to ingest not only unstructured and
semi-structured data but also structured data.
• Apache Sqoop: Sqoop is also data ingestion tool that is used to ingest not only unstructured and semi-
structured data but also structured data.
HDFS Architecture
NAMENODE
HDFS follows the master-slave architecture and it has the following
elements.
• Namenode
• Generally the user data is stored in the files of HDFS. The file in a file
system will be divided into one or more segments and/or stored in
individual data nodes. These file segments are called as blocks. In
other words, the minimum amount of data that HDFS can read or
write is called a Block. The default block size is 64MB, but it can be
increased as per the need to change in HDFS configuration.
Limitations
• Scalability
• The utilization of computational resources is inefficient.
• The Hadoop framework became limited only to MapReduce
processing paradigm.
Hadoop version 2.0
1. What is Hadoop?
25
Q/A
4. Which component of the Hadoop ecosystem is an in-memory data processing engine?
• b) MapReduce
• c) Apache Spark
• d) Apache Hive
26
Q/A
6. Which component of the Hadoop ecosystem is used for batch processing and analysis of large datasets?
• b) MapReduce
• c) Apache Spark
• d) Apache Kafka
Answer: b) MapReduce
27
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 28
THANK YOU
For queries
Email:
[email protected]
29
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
• b) MapReduce
• c) Apache Spark
• d) Apache Hive
6
Q/A
7.Which type of data sources does Sqoop support?
• a) Only relational databases
7
Q/A
8.Which type of data sources does Flume support?
• a) Only relational databases
8
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 9
Q/A
10
THANK YOU
For queries
Email:
[email protected]
11
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
2
WHAT IS HADOOP
Hadoop is an open source, Java-based programming framework that supports the
processing and storage of extremely large data sets in a distributed computing
environment. It is part of the Apache project sponsored by the Apache Software
Foundation.
Able to store of huge amount of data from variety of Sources
Bigger tasks are divided in to smaller tasks and these are processed
and analysed parallelly and later regrouped for final result.
Hadoop Ecosystem
• Hadoop Ecosystem is a collection of tools that
provides services like ingestion, storage, processing
and analysis of big data. Various components of
Hadoop are
HADOOP-ECOSYSTEM
Hadoop version 2.0
11
Q/A
What is the main difference between Hadoop 1.0 and Hadoop 2.0?
• a) Hadoop 2.0 introduced the concept of data replication.
12
Q/A
Which version of Hadoop introduced support for Apache HBase?
• a) Hadoop 1.0
• b) Hadoop 2.0
• c) Hadoop 2.2
• d) Hadoop 3.0
Answer: b) Hadoop 2.0
13
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 14
THANK YOU
For queries
Email:
[email protected]
15
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
2
HDFS
In 2010, Facebook claimed to have one of the largest HDFS cluster
storing 21 Petabytes of data.
In 2012, Facebook declared that they have the largest single HDFS cluster
with more than 100 PB of data.
And Yahoo! has more than 100,000 CPU in over 40,000 servers running
Hadoop, with its biggest Hadoop cluster running 4,500 nodes. All told,
Yahoo! stores 455 petabytes of data in HDFS.
In fact, by 2013, most of the big names in the Fortune 50 started using
Hadoop.
HDFS
• Distributed File System talks about managing data,
i.e. files or folders across multiple computers or
servers.
• Like for windows you have NTFS (New Technology File
System) or for Mac you have HFS (Hierarchical File System).
The only difference is that, in case of Distributed File
System, you store data in multiple machines rather than
single machine. Even though the files are stored across the
network, DFS organizes, and displays data in such a
manner that a user sitting on a machine will feel like all the
data is stored in that very machine.
HDFS
In vertical scaling, you stop your machine first. Then you increase the RAM or CPU to make it a
more robust hardware stack. After you have increased your hardware capacity, you restart
the machine. This down time when you are stopping your system becomes a challenge.
Horizontal scaling (scale out):
Variety and Volume of Data: When we talk about HDFS then we talk about storing huge data
i.e. Terabytes & petabytes of data and different kinds of data. So, you can store any type of
data into HDFS, be it structured, unstructured or semi structured.
Reliability and Fault Tolerance: When you store data on HDFS, it internally divides the given
data into data blocks and stores it in a distributed fashion across your Hadoop cluster. The
information regarding which data block is located on which of the data nodes is recorded in
the metadata on Name Node.
Features/Advantages of HDFS
• Data Integrity: Data Integrity talks about whether the data stored in my HDFS are
correct or not. HDFS constantly checks the integrity of data stored against its
checksum. If it finds any fault, it reports to the name node about it. Then, the name
node creates additional new replicas
• High Throughput: Throughput is the amount of work done in a unit time. It talks
about how fast you can access the data from the file system. Basically, it gives you
an insight about the system performance. As you have seen in the above example
where we used ten machines collectively to enhance computation. There we were
able to reduce the processing time from 43 minutes to a mere 4.3 minutes as all
the machines were working in parallel. and therefore deletes the corrupted copies
• Data Locality: Data locality talks about moving processing unit to data rather than
the data to processing unit. In our traditional system, we used to bring the data to
the application layer and then process it
Q/A
2.What is the primary storage component in HDFS?
• a) NameNode
• b) DataNode
• c) ResourceManager
• d) NodeManager
Answer: b) DataNode
13
Q/A
3.What is the role of the NameNode in HDFS?
• a) Storing and managing the actual data blocks of files.
14
Q/A
5.Which HDFS component is responsible for ensuring data replication
and fault tolerance?
• a) NameNode
• b) DataNode
• c) SecondaryNameNode
• d) CheckpointNode
Answer: a) NameNode
15
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 16
THANK YOU
For queries
Email:
[email protected]
17
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
2
HDFS Architecture:
It regularly receives a Heartbeat and a block report from all the DataNodes in the
cluster to ensure that the DataNodes are live.
It keeps a record of all the blocks in HDFS and in which nodes these blocks are
located.
The NameNode is also responsible to take care of the replication factor of all the
blocks which we will discuss in detail later in this HDFS tutorial blog.
In case of the DataNode failure, the NameNode chooses new DataNodes for new
replicas, balance disk usage and manages the communication traffic to the
DataNodes.
DataNode
The DataNodes perform the low-level read and write requests from
the file system’s clients.
• Apart from these two daemons, there is a third daemon or a process called
Secondary NameNode. The Secondary NameNode works concurrently with the
primary NameNode as a helper daemon. And don’t be confused about the
Secondary NameNode being a backup NameNode because it is not.
Functions of Secondary NameNode:
The Secondary NameNode is one which constantly reads all the file
systems and metadata from the RAM of the NameNode and writes it
into the hard disk or the file system.
It is responsible for combining the EditLogs with FsImage from the
NameNode.
It downloads the EditLogs from the NameNode at regular intervals
and applies to FsImage. The new FsImage is copied back to the
NameNode, which is used whenever the NameNode is started the
next time.
• Hence, Secondary NameNode performs regular checkpoints in HDFS.
Therefore, it is also called CheckpointNode.
Q/A
• In HDFS, large files are split into smaller blocks of fixed size. What is
the default block size in Hadoop 3.x?
• a) 64 MB
• b) 128 MB
• c) 256 MB
• d) 512 MB
Ans: b) 128 MB
11
• Which component of HDFS is responsible for storing the actual data
blocks?
• a) NameNode
• b) DataNode
• c) SecondaryNameNode
• d) ResourceManager
Ans : b) DataNode
12
•d) It assists the NameNode in checkpointing and logging operations.
13
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 14
THANK YOU
For queries
Email:
[email protected]
15
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
2
Block
• HDFS stores each file as blocks which are scattered throughout the
Apache Hadoop cluster. The default size of each block is 128 MB in
Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) which you can
configure as per your requirement.
why we need to have such a huge blocks size i.e. 128 MB?
• Well, whenever we talk about HDFS, we talk about huge data sets, i.e.
Terabytes and Petabytes of data. So, if we had a block size of let’s say
of 4 KB, as in Linux file system, we would be having too many blocks
and therefore too much of the metadata. So, managing these
no. of blocks and metadata will create huge overhead, which is
something, we don’t want
Replication Management:
• HDFS provides a reliable way to store huge data in a distributed environment as data blocks.
• The blocks are also replicated to provide fault tolerance.
• The default replication factor is 3 which is again configurable.
• So, as you can see in the figure below where each block is replicated three times and stored
on different DataNodes
• Therefore, if you are storing a file of 128 MB in HDFS using the default configuration, you will
end up occupying a space of 384 MB (3*128 MB) as the blocks will be replicated three times
and each replica will be residing on a different DataNode.
• Note: The NameNode collects block report from DataNode periodically to maintain the
replication factor. Therefore, whenever a block is over-replicated or under-replicated the
NameNode deletes or add replicas as needed.
Rack Awareness:
• Rack Awareness Algorithm says that
• The first replica of a block will be stored on a local rack and the next
two replicas will be stored on a different (remote) rack but, on a
different DataNode within that (remote) rack as shown in the figure
above.
• If you have more replicas, the rest of the replicas will be placed on
random DataNodes provided not more than two replicas reside on
the same rack,
Rack Awareness:
Advantages of Rack Awareness:
At first, the HDFS client will reach out to the NameNode for a Write Request
against the two blocks, say, Block A & Block B.
The NameNode will then grant the client the write permission and will
provide the IP addresses of the DataNodes where the file blocks will be
copied eventually.
The selection of IP addresses of DataNodes is purely randomized based on
availability, replication factor and rack awareness that we have discussed
earlier.
Let’s say the replication factor is set to default i.e. 3. Therefore, for each
block the NameNode will be providing the client a list of (3) IP addresses of
DataNodes. The list will be unique for each block.
HDFS Write Architecture:
Now the whole data copy process will happen in three stages:
Set up of Pipeline
Data streaming and replication
Shutdown of Pipeline (Acknowledgement stage)
1. Set up of Pipeline:
Before writing the blocks, the client confirms whether the DataNodes,
present in each of the list of IPs, are ready to receive the data or not. In
doing so, the client creates a pipeline for each of the blocks by
connecting the individual DataNodes in the respective list for that block.
Let us consider Block A. The list of DataNodes provided by the
NameNode is:
The client will choose the first DataNode in the list (DataNode IPs for
Block A) which is DataNode 1 and will establish a TCP/IP connection.
• As the pipeline has been created, the client will push the
data into the pipeline.
• Now, don’t forget that in HDFS, data is replicated based on
replication factor. So, here Block A will be stored to three
DataNodes as the assumed replication factor is 3.
• The client will copy the block (A) to DataNode 1 only. The
replication is always done by DataNodes sequentially.
Steps will take place during replication:
• Once the block has been copied into all the three DataNodes, a series of
acknowledgements will take place to ensure the client and NameNode that the
data has been written successfully. Then, the client will finally close the pipeline to
end the TCP session.
• Finally, the DataNode 1 will push three acknowledgements (including its own) into
the pipeline and send it to the client.
• The client will inform NameNode that data has been written successfully.
• The NameNode will update its metadata and the client will shut down the pipeline.
Copying of Block B
• Similarly, Block B will also be copied into the DataNodes in parallel
with Block A. So, the following things are to be noticed here:
The client will copy Block A and Block B to the first
DataNode simultaneously.
Therefore, in our case, two pipelines will be formed for each of the
block and all the process discussed above will happen in parallel in
these two pipelines.
The client writes the block into the first DataNode and then the
DataNodes will be replicating the block sequentially.
• As you can see in the above image, there are two pipelines
formed for each block (A and B). Following is the flow of
operations that is taking place for each block in their
respective pipelines:
For Block A: 1A -> 2A -> 3A -> 4A
For Block B: 1B -> 2B -> 3B -> 4B -> 5B -> 6B
HDFS Read Architecture:
The client starts reading data parallel from the DataNodes (Block
A from DataNode 1 and Block B from DataNode 3).
Once the client gets all the required file blocks, it will combine
these blocks to form a file.
While serving read request of the client, HDFS selects the replica
which is closest to the client. This reduces the read latency and
the bandwidth consumption. Therefore, that replica is selected
which resides on the same rack as the reader node, if possible.
Q/A
• In HDFS, how many replicas of a data block are typically
maintained?
• a) 1
• b) 2
• c) 3
Ans: 3
29
Q/A
• Which factor(s) influence(s) the replication factor for a file in HDFS?
• a) The configured replication factor setting
• b) The size of the file
30
Q/A
What is the purpose of the Block Placement Policy in HDFS?
• a) To determine the number of replicas for each data block
31
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 32
THANK YOU
For queries
Email:
[email protected]
33
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
2
Block
• HDFS stores each file as blocks which are scattered throughout the
Apache Hadoop cluster. The default size of each block is 128 MB in
Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) which you can
configure as per your requirement.
why we need to have such a huge blocks size i.e. 128 MB?
• Well, whenever we talk about HDFS, we talk about huge data sets, i.e.
Terabytes and Petabytes of data. So, if we had a block size of let’s say
of 4 KB, as in Linux file system, we would be having too many blocks
and therefore too much of the metadata. So, managing these
no. of blocks and metadata will create huge overhead, which is
something, we don’t want
Replication Management:
• HDFS provides a reliable way to store huge data in a distributed environment as data blocks.
• The blocks are also replicated to provide fault tolerance.
• The default replication factor is 3 which is again configurable.
• So, as you can see in the figure below where each block is replicated three times and stored
on different DataNodes
• Therefore, if you are storing a file of 128 MB in HDFS using the default configuration, you will
end up occupying a space of 384 MB (3*128 MB) as the blocks will be replicated three times
and each replica will be residing on a different DataNode.
• Note: The NameNode collects block report from DataNode periodically to maintain the
replication factor. Therefore, whenever a block is over-replicated or under-replicated the
NameNode deletes or add replicas as needed.
Rack Awareness:
• Rack Awareness Algorithm says that
• The first replica of a block will be stored on a local rack and the next
two replicas will be stored on a different (remote) rack but, on a
different DataNode within that (remote) rack as shown in the figure
above.
• If you have more replicas, the rest of the replicas will be placed on
random DataNodes provided not more than two replicas reside on
the same rack,
Rack Awareness:
Advantages of Rack Awareness:
At first, the HDFS client will reach out to the NameNode for a Write Request
against the two blocks, say, Block A & Block B.
The NameNode will then grant the client the write permission and will
provide the IP addresses of the DataNodes where the file blocks will be
copied eventually.
The selection of IP addresses of DataNodes is purely randomized based on
availability, replication factor and rack awareness that we have discussed
earlier.
Let’s say the replication factor is set to default i.e. 3. Therefore, for each
block the NameNode will be providing the client a list of (3) IP addresses of
DataNodes. The list will be unique for each block.
HDFS Write Architecture:
Now the whole data copy process will happen in three stages:
Set up of Pipeline
Data streaming and replication
Shutdown of Pipeline (Acknowledgement stage)
1. Set up of Pipeline:
Before writing the blocks, the client confirms whether the DataNodes,
present in each of the list of IPs, are ready to receive the data or not. In
doing so, the client creates a pipeline for each of the blocks by
connecting the individual DataNodes in the respective list for that block.
Let us consider Block A. The list of DataNodes provided by the
NameNode is:
The client will choose the first DataNode in the list (DataNode IPs for
Block A) which is DataNode 1 and will establish a TCP/IP connection.
• As the pipeline has been created, the client will push the
data into the pipeline.
• Now, don’t forget that in HDFS, data is replicated based on
replication factor. So, here Block A will be stored to three
DataNodes as the assumed replication factor is 3.
• The client will copy the block (A) to DataNode 1 only. The
replication is always done by DataNodes sequentially.
Steps will take place during replication:
• Once the block has been copied into all the three DataNodes, a series of
acknowledgements will take place to ensure the client and NameNode that the
data has been written successfully. Then, the client will finally close the pipeline to
end the TCP session.
• Finally, the DataNode 1 will push three acknowledgements (including its own) into
the pipeline and send it to the client.
• The client will inform NameNode that data has been written successfully.
• The NameNode will update its metadata and the client will shut down the pipeline.
Copying of Block B
• Similarly, Block B will also be copied into the DataNodes in parallel
with Block A. So, the following things are to be noticed here:
The client will copy Block A and Block B to the first
DataNode simultaneously.
Therefore, in our case, two pipelines will be formed for each of the
block and all the process discussed above will happen in parallel in
these two pipelines.
The client writes the block into the first DataNode and then the
DataNodes will be replicating the block sequentially.
• As you can see in the above image, there are two pipelines
formed for each block (A and B). Following is the flow of
operations that is taking place for each block in their
respective pipelines:
For Block A: 1A -> 2A -> 3A -> 4A
For Block B: 1B -> 2B -> 3B -> 4B -> 5B -> 6B
HDFS Read Architecture:
The client starts reading data parallel from the DataNodes (Block
A from DataNode 1 and Block B from DataNode 3).
Once the client gets all the required file blocks, it will combine
these blocks to form a file.
While serving read request of the client, HDFS selects the replica
which is closest to the client. This reduces the read latency and
the bandwidth consumption. Therefore, that replica is selected
which resides on the same rack as the reader node, if possible.
Q/A
• When a client wants to write a file to HDFS, which component does
it initially contact?
• a) NameNode
• b) DataNode
• c) ResourceManager
• d) SecondaryNameNode
Ans: NameNode
29
Q/A
What is the purpose of the pipeline concept in HDFS write
architecture?
a) To ensure data replication across multiple DataNodes
b) b) To minimize data transfer latency during write operations
c) c) To manage metadata updates in the NameNode
d) d) To encrypt data during write operations
30
Q/A
• How many DataNodes are involved in the pipeline for writing a data
block in HDFS by default?
• a) 1
• b) 2
• c) 3
31
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 32
THANK YOU
For queries
Email:
[email protected]
33
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
2
HDFS Read Architecture:
The client starts reading data parallel from the DataNodes (Block
A from DataNode 1 and Block B from DataNode 3).
Once the client gets all the required file blocks, it will combine
these blocks to form a file.
While serving read request of the client, HDFS selects the replica
which is closest to the client. This reduces the read latency and
the bandwidth consumption. Therefore, that replica is selected
which resides on the same rack as the reader node, if possible.
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 6
Q/A
• What is the role of the NameNode in the HDFS read architecture?
• a) Storing the actual data blocks
• b) Managing metadata and file system namespace
• c) Coordinating read requests from multiple clients
• d) Distributing data across multiple DataNodes
• Ans b) Managing metadata and file system namespace
7
Q/A
• In HDFS, what is the purpose of the read cache?
• a) To store frequently accessed file metadata for faster retrieval
Ans : B) To buffer and temporarily store the read data blocks for
improved performance
8
Q/A
• Which component in HDFS is responsible for maintaining the
mapping of file blocks to DataNodes?
• a) NameNode
• b) DataNode
• c) ResourceManager
• d) SecondaryNameNode
Ans: a) NameNode
9
THANK YOU
For queries
Email:
[email protected]
10
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
• c) Facebook
• d) Amazon
Ans: a) Google
8
Q/A
What is the purpose of the Map phase in MapReduce?
• a) To process and analyze input data in parallel
9
Q/A
Which programming language is commonly used for writing
MapReduce programs?
• a) Java
• b) Python
• c) C++
• d) Ruby
Ans : a) Java
10
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 11
THANK YOU
For queries
Email:
[email protected]
12
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
• Scalability
• The utilization of computational resources is inefficient.
• The Hadoop framework became limited only to MapReduce
processing paradigm.
Hadoop version 2.0
Ans: b
11
Q/A
• Which programming framework is commonly used with YARN for
distributed data processing?
• a) Hadoop MapReduce
• b) Apache Spark
• c) Apache Hive
• d) Apache Kafka
12
Q/A
• What is the role of the NodeManager in YARN?
• a) Managing and allocating cluster resources
Ans: a
13
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 14
THANK YOU
For queries
Email:
[email protected]
15
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
• Scalability
• The utilization of computational resources is inefficient.
• The Hadoop framework became limited only to MapReduce
processing paradigm.
Q/A
• Which phase in the MapReduce process shuffles and sorts the
intermediate key-value pairs?
• a) Map
• b) Reduce
• c) Shuffle
• d) Sort
Ans c) Shuffle
17
Q/A
• Which Hadoop component provides the runtime environment for
executing MapReduce jobs?
• a) Hadoop Distributed File System (HDFS)
• b) Hadoop YARN (Yet Another Resource Negotiator)
• c) Hadoop MapReduce Framework
• d) Hadoop Pig
18
Q/A
• Which programming language is commonly used to write
MapReduce jobs?
• a) Python
• b) Java
• c) C++
• d) JavaScript
Ans b) Java
19
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 20
THANK YOU
For queries
Email:
[email protected]
21
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
2. Node Manager: They run on the slave daemons and are responsible for
the execution of a task on every single Data Node.
3. Application Master: Manages the user job lifecycle and resource needs of
individual applications. It works along with the Node Manager and monitors
the execution of tasks.
29
Q/A
• Which command is used to submit a MapReduce job to YARN for
execution?
• a) hadoop jar
• b) yarn jar
• c) hdfs jar
• d) mapred jar
30
Q/A
Which of the following is a key component of YARN?
a)HDFS (Hadoop Distributed File System)
b) MapReduce
c) YARN Application Manager
d) HBase (Hadoop Database)
Ans: c) YARN Application Manager
31
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 32
THANK YOU
For queries
Email:
[email protected]
33
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)
2. Node Manager: They run on the slave daemons and are responsible for
the execution of a task on every single Data Node.
3. Application Master: Manages the user job lifecycle and resource needs of
individual applications. It works along with the Node Manager and monitors
the execution of tasks.
29
Q/A
• Which of the following is responsible for monitoring and tracking
the status of YARN applications?
• a) YARN Resource Manager
• b) YARN Node Manager
• c) HDFS NameNode
• d) YARN Application Manager
30
Q/A
What is the primary function of YARN in Hadoop?
a) Data storage and retrieval
b) Cluster management and resource allocation
c) Data processing and analysis
d) Data replication and fault tolerance
31
References:
✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 32
THANK YOU
For queries
Email:
[email protected]
33