Bda Summer 2022 Solution
Bda Summer 2022 Solution
Bda Summer 2022 Solution
[Q.1]
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
(b) Explain 4 V’s of Big data.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
(c) What is Hadoop? Briefly explain the core components of it.
Hadoop is a framework that uses distributed storage and parallel processing to store
and manage big data. It is the software most used by data analysts to handle big data,
and its market size continues to grow. There are three components of Hadoop:
1. Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit.
2. Hadoop MapReduce - Hadoop MapReduce is the processing unit.
3. Hadoop YARN - Yet Another Resource Negotiator (YARN) is a resource management
unit.
Hadoop HDFS
Features of HDFS
Hadoop MapReduce
A data containing code is used to process the entire data. This coded data
is usually very small in comparison to the data itself. You only need to
send a few kilobytes worth of code to perform a heavy-duty process on
computers.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
The input dataset is first split into chunks of data. In this example, the
input has three lines of text with three separate entities - “bus car train,”
“ship ship train,” “bus ship car.” The dataset is then split into three chunks,
based on these entities, and processed parallelly.
In the map phase, the data is assigned a key and a value of 1. In this
case, we have one bus, one car, one ship, and one train.
These key-value pairs are then shuffled and sorted together based on
their keys. At the reduce phase, the aggregation takes place, and the final
output is obtained.
Hadoop YARN
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
Suppose a client machine wants to do a query or fetch some code
for data analysis. This job request goes to the resource manager (Hadoop
Yarn), which is responsible for resource allocation and management.
[Q.2]
Big data and Hadoop are two different concepts but they are inter related.
In simple terms Big data is massive amount of data and Hadoop is the framework
which is used to store, process, and analyze this data.
Data has evolved rapidly in the last decade. In the earlier days, there were less data
generating sources and there was only one type of data being generated that was the
structured data.
Only a single traditional database was enough to store and process this data. But as
time passed by, the number of sources increased and data was generated in large
amount all across the globe.
This generated data was of three types, structured data like excel records, semi
structured data like mails and unstructured data like videos and images.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
It was now difficult for a traditional database to store, process and analyze this data.
This data is termed as big data, data which is huge to be stored, processed or
analyzed using traditional databases.
Hadoop is a framework that manages big data storage in a distributed way and
processes it parallelly.
Hadoop has 3 components - HDFS, MapReduce, and YARN. Hadoop Distributed File
System (HDFS) is specially designed for storing huge datasets in commodity hardware.
NFS (Network File System) is one of the oldest and popular distributed file storage
systems. Whereas HDFS (Hadoop Distributed File System) is the recently used and
popular one to handle big data.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
NFS HDFS
NFS can store and process small amount HDFS is mainly use to store and process
of data. big data.
Data is stored on a single dedicated The data blocks are distributed on the local
hardware. drives of hardware.
No reliability, data is not available in the Data is stored reliably, data is available
case of machine failure. even after machine failure.
NFS runs on single machine, no chances HDFS runs on a cluster of different
of data redundancy. machines, data redundancy may occur due
to replication protocol.
Workgroup. Larger than AFS.
Single Domain. Multi Domain.
Client identity is trusted by default. Client identity is what os tells. No Kerberos
Auth.
Same System calls as of O/S. Different Calls.Mainly used for non
interactive programs.
(c) Draw and Explain HDFS architecture. How can you restart NameNode and all
the daemons in Hadoop?
This HDFS tutorial by DataFlair is designed to be an all in one package to answer all
your questions about HDFS architecture.
Hadoop Distributed File System(HDFS) is the world’s most reliable storage
system. It is best known for its fault tolerance and high availability.
In this article about HDFS Architecture Guide, you can read all about Hadoop
HDFS.
HDFS stores very large files running on a cluster of commodity hardware.
It works on the principle of storage of less number of large files rather than the
huge number of small files.
HDFS stores data reliably even in the case of hardware failure.
It provides high throughput by providing the data access in parallel.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
Let’s discuss each of the nodes in the Hadoop HDFS Architecture in detail.
HDFS NameNode
NameNode is the centerpiece of the Hadoop Distributed File System.
It maintains and manages the file system namespace and provides the
right access permission to the clients.
The NameNode stores information about blocks locations, permissions,
etc. on the local disk in the form of two files:
1. Fsimage: Fsimage stands for File System image. It contains the
complete namespace of the Hadoop file system since the
NameNode creation.
2. Edit log: It contains all the recent changes performed to the file
system namespace to the most recent Fsimage.
HDFS DataNode
DataNodes are the slave nodes in Hadoop HDFS. DataNodes
are inexpensive commodity hardware. They store blocks of a file.
Checkpoint Node
The Checkpoint node is a node that periodically creates checkpoints of the
namespace.
Checkpoint Node in Hadoop first downloads Fsimage and edits from the
Active Namenode.
Then it merges them (Fsimage and edits) locally, and at last, it uploads
the new image back to the active NameNode.
It stores the latest checkpoint in a directory that has the same structure as
the Namenode’s directory.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
Backup Node
A Backup node provides the same checkpointing functionality as the
Checkpoint node.
OR
In Hadoop 2 onwards Resource Manager and Node Manager are the daemon
services. When the job client submits a MapReduce job, these daemons come into
action.
They are also responsible for parallel processing and fault-tolerance features of
MapReduce jobs.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
Compared to Hadoop 1 with Job Tracker and Task Tracker, Hadoop 2 contains a
global Resource Manager (RM) and Application Masters (AM) for each application.
1. Mapper
It is the first phase of MapReduce programming and contains the coding logic of
The conditional logic is applied to the ‘n’ number of data blocks spread across
Mapper function accepts key-value pairs as input as (k, v), where the key
represents the offset address of each record and the value represents the entire
record content.
The output of the Mapper phase will also be in the key-value format as (k’, v’).
The output of various mappers (k’, v’), then goes into Shuffle and Sort phase.
All the duplicate values are removed, and different values are grouped together
The output of the Shuffle and Sort phase will be key-value pairs again as key and
3. Reducer
The output of the Shuffle and Sort phase (k, v[]) will be the input of the Reducer
phase.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
In this phase reducer function’s logic is executed and all the values are
Reducer consolidates outputs of various mappers and computes the final job
output.
The final output is then written into a single file in an output directory of HDFS.
4. Combiner
In this phase, various outputs of the mappers are locally reduced at the node
level.
For example, if different mapper outputs (k, v) coming from a single node
contains duplicates, then they get combined i.e. locally reduced as a single (k,
v[]) output.
This phase makes the Shuffle and Sort phase work even quicker thereby
[Q.3]
(a) What do you mean by job scheduling in Hadoop? List different schedulers in
Hadoop.
Job scheduling is the process of allocating system resources to many different tasks by an
operating system (OS).
The system handles prioritized job queues that are awaiting CPU time and it should determine
which job to be taken from which queue and the amount of time to be allocated for the job.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
There are mainly 3 types of Schedulers in Hadoop:
1. FIFO (First In First Out) Scheduler.
2. Capacity Scheduler.
3. Fair Scheduler.
1. FIFO Scheduler
As the name suggests FIFO i.e. First In First Out, so the tasks or application that comes first will be served
first. This is the default Scheduler we use in Hadoop.
The tasks are placed in a queue and the tasks are performed in their submission order. In this method,
once the job is scheduled, no intervention is allowed.
So sometimes the high-priority process has to wait for a long time since the priority of the task does not
matter in this method.
2. Capacity Scheduler
In Capacity Scheduler we have multiple job queues for scheduling our tasks. The Capacity Scheduler
allows multiple occupants to share a large size Hadoop cluster.
In Capacity Scheduler corresponding for each job queue, we provide some slots or cluster resources for
performing job operation.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
3. Fair Scheduler
The Fair Scheduler is very much similar to that of the capacity scheduler. The priority of the job is kept in
consideration.
With the help of Fair Scheduler, the YARN applications can share the resources in the large Hadoop
Cluster and these resources are maintained dynamically so no need for prior capacity.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
HBase has three major components: the client library, a master server, and region
servers. Region servers can be added or removed as per requirement.
MasterServer
The master server -
Assigns regions to the region servers and takes the help of Apache ZooKeeper for
this task.
Handles load balancing of the regions across region servers. It unloads the busy
servers and shifts the regions to less occupied servers.
Maintains the state of the cluster by negotiating the load balancing.
Is responsible for schema changes and other metadata operations such as creation
of tables and column families.
Regions
Regions are nothing but tables that are split up and spread across the region servers.
Region server
The region servers have regions that -
Communicate with the client and handle data-related operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.
When we take a deeper look into the region server, it contain regions and stores as
shown below:
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
The store contains memory store and HFiles. Memstore is just like a cache memory.
Anything that is entered into the HBase is stored here initially. Later, the data is
transferred and saved in Hfiles as blocks and the memstore is flushed.
Zookeeper
Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization, etc.
Zookeeper has ephemeral nodes representing different region servers. Master
servers use these nodes to discover available servers.
In addition to availability, the nodes are also used to track server failures or network
partitions.
Clients communicate with region servers via zookeeper.
In pseudo and standalone modes, HBase itself will take care of zookeeper.
OR
[Q.3]
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
Clients communicate with region servers via zookeeper.
In pseudo and standalone modes, HBase itself will take care of zookeeper.
(c) What is NoSQL database? List the differences between NoSQL and relational
databases. Explain in brief various types of NoSQL databases in practice.
NoSQL originally referring to non SQL or non relational is a database that provides a
mechanism for storage and retrieval of data.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
Explain in brief various types of NoSQL databases in practice.
Document-based databases
Key-value stores
Column-oriented databases
Graph-based databases
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
Document-Based Database:
Key-Value Stores:
Graph-Based databases:
[Q.4]
It provides a high-level scripting language, known as Pig Latin which is used to
develop the data analysis codes.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
(b) What are the features of MongoDB?
Schema-less Database:
It is the great feature provided by the MongoDB.
A Schema-less database means one collection can hold different types
of documents in it. Or in other words, in the MongoDB database, a single
collection can hold multiple documents and these documents may
consist of the different numbers of fields, content, and size.
It is not necessary that the one document is similar to another document
like in the relational databases.
Due to this cool feature, MongoDB provides great flexibility to databases.
Document Oriented:
In MongoDB, all the data stored in the documents instead of tables like in
RDBMS.
In these documents, the data is stored in fields(key-value pair) instead of
rows and columns which make the data much more flexible in
comparison to RDBMS.
And each document contains its unique object id.
Indexing:
In MongoDB database, every field in the documents is indexed with
primary and secondary indices this makes easier and takes less time to
get or search data from the pool of the data.
If the data is not indexed, then database search each document with the
specified query which takes lots of time and not so efficient.
Scalability:
MongoDB provides horizontal scalability with the help of sharding.
Sharding means to distribute data on multiple servers, here a large
amount of data is partitioned into data chunks using the shard key, and
these data chunks are evenly distributed across shards that reside
across many physical servers.
It will also add new machines to a running database.
Replication:
MongoDB provides high availability and redundancy with the help of
replication, it creates multiple copies of the data and sends these copies
to a different server so that if one server fails, then the data is retrieved
from another server.
Aggregation:
It allows to perform operations on the grouped data and get a single
result or computed result.
It is similar to the SQL GROUPBY clause.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
It provides three different aggregations i.e, aggregation pipeline, map-
reduce function, and single-purpose aggregation methods
High Performance:
The performance of MongoDB is very high and data persistence as
compared to another database due to its features like scalability,
indexing, replication, etc.
(c) Explain the concept of regions in HBase and storing Big data with HBase.
X
OR
[Q.4]
HBase tables contain column families and rows with elements defined as Primary
keys.
Set of tables
Each table with column families and rows
Each table must have an element defined as Primary Key.
Row key acts as a Primary key in HBase.
Any access to HBase tables uses this Primary Key
Each column present in HBase denotes attribute corresponding to object
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
(c) Explain Pig data Model in detail and Discuss how it will help for effective
data flow.
pig data model includes Pig’s data types, how it handles concepts such as missing
data, and how you can describe your data to Pig.
Pig has three complex data types: maps, tuples, and bags. All of these types can
contain data of any type, including other complex types. So it is possible to have a map
where the value field is a bag, which contains a tuple where one of the fields is a map.
Map :
A map in Pig is a chararray to data element mapping, where that element can be any
Pig type, including a complex type.
The chararray is called a key and is used as an index to find the element, referred to
as the value.
Because Pig does not know the type of the value, it will assume it is a bytearray.
Map constants are formed using brackets to delimit the map, a hash between keys
and values, and a comma between key-value pairs.
Tuple :
Tuples are divided into fields, with each field containing one data element. These
elements can be of any type—they do not all need to be the same type.
Tuple constants use parentheses to indicate the tuple and commas to delimit fields in
the tuple. For example, ('bob', 55) describes a tuple constant with two fields.
Bag :
Like tuples, a bag can, but is not required to, have a schema associated with it. In the
case of a bag, the schema describes all tuples within the bag.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
Bag constants are constructed using braces, with tuples in the bag separated by
commas. For example, {('bob', 55), ('sally', 52), ('john', 25)} constructs a bag with three
tuples, each with two fields.
Bag is the one type in Pig that is not required to fit into memory. As you will see later,
because bags are used to store collections when grouping, bags can become quite
large.
Pig has the ability to spill bags to disk when necessary, keeping only partial sections
of the bag in memory.
The size of the bag is limited to the amount of local disk available for spilling the bag.
[Q.5]
It utilizes in-memory caching and optimized query execution for fast queries
against data of any size.
Simply put, Spark is a fast and general engine for large-scale data
processing.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
1. Apache Spark Core – Spark Core is the underlying general execution engine for
the Spark platform that all other functionality is built upon. It provides in-memory
computing and referencing datasets in external storage systems.
2. Spark SQL – Spark SQL is Apache Spark’s module for working with structured
data. The interfaces offered by Spark SQL provides Spark with more information
about the structure of both the data and the computation being performed.
3. Spark Streaming – This component allows Spark to process real-time streaming
data.
Data can be ingested from many sources like Kafka, Flume, and HDFS
(Hadoop Distributed File System).
Then the data can be processed using complex algorithms and pushed
out to file systems, databases, and live dashboards.
5. GraphX – Spark also comes with a library to manipulate graph databases and
perform computations called GraphX. GraphX unifies ETL (Extract, Transform,
and Load) process, exploratory analysis, and iterative graph computation within a
single system.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
(b) Write difference between MongoDB and Hadoop.
CRUD operations describe the conventions of a user-interface that let users view, search, and modify
parts of the database.
MongoDB documents are modified by connecting to a server, querying the proper documents, and then
changing the setting properties before sending the data back to the database to be updated. CRUD is
data-oriented, and it’s standardized according to HTTP action verbs.
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
The Create operation is used to insert new documents in the MongoDB database.
The Read operation is used to query a document in the database.
The Update operation is used to modify existing documents in the database.
The Delete operation is used to remove documents in the database.
Create Operations
db.collection.insertOne()
db.collection.insertMany()
Read Operations
The read operations allow you to supply special query filters and criteria
that let you specify which documents you want.
The MongoDB documentation contains more information on the available
query filters.
Query modifiers may also be used to change how many results are
returned.
MongoDB has two methods of reading documents from a collection:
db.collection.find()
db.collection.findOne()
Update Operations
db.collection.updateOne()
db.collection.updateMany()
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
db.collection.replaceOne()
Delete Operations
db.collection.deleteOne()
db.collection.deleteMany()
OR
[Q.5]
RDD Operations
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
o Transformation
o Action
Transformation
2. filter(func): It returns a new dataset formed by selecting those elements of the source
on which func returns true.
3. flatMap(func): Here, each input item can be mapped to zero or more output items, so
func should return a sequence rather than a single item.
5. pipe(command, [envVars]): Pipe each partition of the RDD through a shell command,
e.g. a Perl or bash script.
Action
In Spark, the role of action is to return a value to the driver program after
running a computation on the dataset.
4. countByKey() : It is only available on RDDs of type (K, V). Thus, it returns a hashmap
of (K, Int) pairs with the count of each key.
5. foreach(func) : It runs a function func on each element of the dataset for side effects
such as updating an Accumulator or interacting with external storage systems.
**********
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA
YouTube: https://www.youtube.com/channel/UClk43_DjgTzodjJjumyRlwA