0% found this document useful (0 votes)
22 views28 pages

KCS061 Solution

Uploaded by

Akshay Dwivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views28 pages

KCS061 Solution

Uploaded by

Akshay Dwivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

B. Tech.

(SEM VI) THEORY EXAMINATION 2021-22


BIG DATA (KCS061)
Time: 3 Hours Total Marks: 100
SECTION A

Q1(a) List any five Big Data platforms.


Answer:
1. Microsoft Azure.
2. Cloudera.
3. Sisense.
4. Collibra.
5. Tableau.
6. Qualtrics.
7. Oracle.
8. MongoDB.
Q1(b) Write any two industry examples for Big Data.
Answer:
1. Banking and Securities: Retail traders, Big banks, hedge funds, and other so-called
‘big boys’ in the financial markets use Big Data for trade analytics used in high-
frequency trading, pre-trade decision-support analytics, sentiment measurement,
Predictive Analytics, etc. This industry also heavily relies on Big Data for risk
analytics, including; anti-money laundering, demand enterprise risk management,
"Know Your Customer," and fraud mitigation.
2. Healthcare Providers: The healthcare sector has access to huge amounts of data but
has been plagued by failures in utilizing the data to curb the cost of rising healthcare
and by inefficient systems that stifle faster and better healthcare benefits across the
board. This is mainly because electronic data is unavailable, inadequate, or unusable.
Additionally, the healthcare databases that hold health-related information have made
it difficult to link data that can show patterns useful in the medical field
Q1(c) What is the role of Sort & Shuffle in Map-Reduce?
Answer: Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer in
MapReduce. Sort phase in MapReduce covers the merging and sorting of map outputs. Data
from the mapper are grouped by the key, split among reducers, and sorted by the key. Every
reducer obtains all values associated with the same key. Shuffle and sort phase in Hadoop occur
simultaneously and are done by the MapReduce framework.
Q1(d) Give the full form of HDFS.
Answer: Hadoop Distributed File System

Q1(e) What is the block size of a HDFS?


Answer: A typical block size used by HDFS is 128 MB.
Q1(f) Name the two type of nodes in Hadoop.
Answer:
1. NameNode
2. DataNode
Q1(g) Compare and Contrast No SQL Relational Databases.
Answer:
Relational Database NoSQL
It is used to handle data coming in low It is used to handle data coming in high
velocity. velocity.
It gives only read scalability. It gives both read and write scalability.
It manages structured data. It manages all type of data.
Data arrives from one or few locations. Data arrives from many locations.
It supports complex transactions. It supports simple transactions.
It has single point of failure. No single point of failure.
It handles data in less volume. It handles data in high volume.
Transactions written in one location. Transactions written in many locations.
Deployed in vertical fashion. Deployed in Horizontal fashion.

Q1(h) Does MongoDB support ACID properties? Justify your answer.


Answer: MongoDB do not need a fixed table structure and does not provide a full ACID
support. It provides eventually consistency, which means that data will be consistent over a
period of time
Q1(i) Describe schema.
Answer: A database schema defines how data is organized within a database; this is inclusive
of logical constraints such as, table names, fields, data types, and the relationships between
these entities
Q1(j) Discuss the different types of data that can be handled with HIVE.
Answer:
1. Structured data is information that has been formatted and transformed into a well-
defined data model.
2. Semi-structured data is a type of data that has some consistent and definite
characteristics. It does not confine into a rigid structure such as that needed for
relational databases.
3. Unstructured data is defined as data present in absolute raw form. This data is
difficult to process due to its complex arrangement and formatting. Unstructured data
management may take data from many forms, including social media posts, chats,
satellite imagery, IoT sensor data, emails, and presentations, to organize it in a
logical, predefined manner in a data storage.
SECTION B

Q2(a) Detail about the three dimensions of BIG data.


Answer:

VOLUME
Within the Social Media space for example, Volume refers to the amount of data generated
through websites, portals and online applications. Especially for B2C companies, Volume
encompasses the available data that are out there and need to be assessed for relevance.
Consider the following -Facebook has 2 billion users, Youtube 1 billion users, Twitter 350
million users and Instagram 700 million users. Every day, these users contribute to billions of
images, posts, videos, tweets etc. You can now imagine the insanely large amount -or Volume-
of data that is generated every minute and every hour.
VELOCITY
With Velocity we refer to the speed with which data are being generated. Staying with our
social media example, every day 900 million photos are uploaded on Facebook, 500 million
tweets are posted on Twitter, 0.4 million hours of video are uploaded on Youtube and 3.5
billion searches are performed in Google. This is like a nuclear data explosion. Big Data helps
the company to hold this explosion, accept the incoming flow of data and at the same time
process it fast so that it does not create bottlenecks.
VARIETY
Variety in Big Data refers to all the structured and unstructured data that has the possibility of
getting generated either by humans or by machines. The most commonly added data are
structured -texts, tweets, pictures & videos. However, unstructured data like emails,
voicemails, hand-written text, ECG reading, audio recordings etc, are also important elements
under Variety. Variety is all about the ability to classify the incoming data into various
categories.
Q2(b) Explain the role of Map-Reduce architecture also explain its advantage and
disadvantages in Map-Reduce framework.
Answer:
MapReduce and HDFS are the two major components of Hadoop which makes it so powerful
and efficient to use. MapReduce is a programming model used for efficient processing in
parallel over large data-sets in a distributed manner. The data is first split and then combined
to produce the final result. The libraries for MapReduce is written in so many programming
languages with various different-different optimizations. The purpose of MapReduce in
Hadoop is to Map each of the jobs and then it will reduce it to equivalent tasks for providing
less overhead over the cluster network and to reduce the processing power. The MapReduce
task is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:

Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result
of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing

Advantages of MapReduce programming

The advantages of MapReduce programming are,


Scalability
Hadoop is a platform that is highly scalable. This is largely because of its ability to store as
well as distribute large data sets across plenty of servers. These servers can be inexpensive and
can operate in parallel. And with each addition of servers one adds more processing power.
Cost-effective solution
Hadoop’s highly scalable structure also implies that it comes across as a very cost-effective
solution for businesses that need to store ever growing data dictated by today’s requirements
Flexibility
Business organizations can make use of Hadoop MapReduce programming to have access to
various new sources of data and also operate on different types of data, whether they are
structured or unstructured. This allows them to generate value from all of the data that can be
accessed by them.
Fast
Hadoop uses a storage method known as distributed file system, which basically implements
a mapping system to locate data in a cluster. The tools used for data processing, such as
MapReduce programming, are also generally located in the very same servers, which allows
for faster processing of data.
Security and Authentication
Security is a vital aspect of any application. If any unlawful person or organization had access
to multiple petabytes of your organization’s data, it can do you massive harm in terms of
business dealings and operations. In this regard, MapReduce works with HDFS and HBase
security that allows only approved users to operate on data stored in the system.
Parallel processing
One of the primary aspects of the working of MapReduce programming is that it divides tasks
in a manner that allows their execution in parallel. Parallel processing allows multiple
processors to take on these divided tasks, such that they run entire programs in less time.
Availability and resilient nature
When data is sent to an individual node in the entire network, the very same set of data is also
forwarded to the other numerous nodes that make up the network. Thus, if there is any failure
that affects a particular node, there are always other copies that can still be accessed whenever
the need may arise. This always assures the availability of data.
Simple model of programming
Among the various advantages that Hadoop MapReduce offers, one of the most important
ones is that it is based on a simple programming model. This basically allows programmers to
develop MapReduce programs that can handle tasks with more ease and efficiency.

Q2(c) Examine how a client read and write data in HDFS.


Answer: HDFS follow Write once Read many models. So we cannot edit files already stored
in HDFS, but we can append data by reopening the file. In Read-Write operation client first,
interact with the NameNode. NameNode provides privileges so, the client can easily read and
write data blocks into/from the respective datanodes.
Read Operation in HDFS
1. A client initiates read request by calling ‘open()’ method of FileSystem object; it is
an object of type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information such as
the locations of the blocks of the file. Please note that these addresses are of first few
blocks of a file.
3. In response to this metadata request, addresses of the DataNodes having a copy of
that block is returned back.
4. Once addresses of DataNodes are received, an object of type FSDataInputStream is
returned to the client. FSDataInputStream contains DFSInputStream which takes
care of interactions with DataNode and NameNode. In step 4 shown in the above
diagram, a client invokes ‘read()’ method which causes DFSInputStream to establish
a connection with the first DataNode with the first block of a file.
5. Data is read in the form of streams wherein client invokes ‘read()’ method repeatedly.
This process of read() operation continues till it reaches the end of block.
6. Once the end of a block is reached, DFSInputStream closes the connection and moves
on to locate the next DataNode for the next block
7. Once a client has done with the reading, it calls a close() method.
Write Operation in HDFS
1. A client initiates write operation by calling ‘create()’ method of
DistributedFileSystem object which creates a new file – Step no. 1 in the above
diagram.
2. DistributedFileSystem object connects to the NameNode using RPC call and initiates
new file creation. However, this file creates operation does not associate any blocks
with the file. It is the responsibility of NameNode to verify that the file (which is
being created) does not exist already and a client has correct permissions to create a
new file. If a file already exists or client does not have sufficient permission to create
a new file, then IOException is thrown to the client. Otherwise, the operation
succeeds and a new record for the file is created by the NameNode.
3. Once a new record in NameNode is created, an object of type FSDataOutputStream
is returned to the client. A client uses it to write data into the HDFS. Data write
method is invoked (step 3 in the diagram).
4. FSDataOutputStream contains DFSOutputStream object which looks after
communication with DataNodes and NameNode. While the client continues writing
data, DFSOutputStream continues creating packets with this data. These packets are
enqueued into a queue which is called as DataQueue.
5. There is one more component called DataStreamer which consumes this DataQueue.
DataStreamer also asks NameNode for allocation of new blocks thereby picking
desirable DataNodes to be used for replication.
6. Now, the process of replication starts by creating a pipeline using DataNodes. In our
case, we have chosen a replication level of 3 and hence there are 3 DataNodes in the
pipeline.
7. The DataStreamer pours packets into the first DataNode in the pipeline.
8. Every DataNode in a pipeline stores packet received by it and forwards the same to
the second DataNode in a pipeline.
9. Another queue, ‘Ack Queue’ is maintained by DFSOutputStream to store packets
which are waiting for acknowledgment from DataNodes.
10. Once acknowledgment for a packet in the queue is received from all DataNodes in
the pipeline, it is removed from the ‘Ack Queue’. In the event of any DataNode
failure, packets from this queue are used to reinitiate the operation.
11. After a client is done with the writing data, it calls a close() method (Step 9 in the
diagram) Call to close(), results into flushing remaining data packets to the pipeline
followed by waiting for acknowledgment.
12. Once a final acknowledgment is received, NameNode is contacted to tell it that the
file write operation is complete.

Q2(d) With the help of suitable example, explain how CRUD operations are performed
in MongoDB.
Answer: MongoDB provides a set of some basic but most essential operations that will help
you to easily interact with the MongoDB server and these operations are known as CRUD
operations.
Create Operations
The create or insert operations are used to insert or add new documents in the collection. If a
collection does not exist, then it will create a new collection in the database. You can perform,
create operations using the following methods provided by the MongoDB:
Method Description
db.collection.insertOne() It is used to insert a single document in the collection.
db.collection.insertMany() It is used to insert multiple documents in the collection.

Example: In this example, we are inserting details of a single student in the form of document
in the student collection using db.collection.insertOne() method.

Read Operations
The Read operations are used to retrieve documents from the collection, or in other words,
read operations are used to query a collection for a document. You can perform read operation
using the following method provided by the MongoDB:
Method Description
db.collection.find() It is used to retrieve documents from the collection.

Example: In this example, we are retrieving the details of students from the student
collection using db.collection.find() method.
Update Operations –
The update operations are used to update or modify the existing document in the collection.
You can perform update operations using the following methods provided by the
MongoDB:
Method Description
db.collection.updateOne() It is used to update a single document in the collection
that satisfy the given criteria.
db.collection.updateMany() It is used to update multiple documents in the collection
that satisfy the given criteria.
db.collection.replaceOne() It is used to replace single document in the collection that
satisfy the given criteria.

Example: In this example, we are updating the age of Sumit in the student collection
using db.collection.updateOne() method.

Delete Operations –
The delete operation are used to delete or remove the documents from a collection. You can
perform delete operations using the following methods provided by the MongoDB:
Method Description
db.collection.deleteOne() It is used to delete a single document from the collection
that satisfy the given criteria.
db.collection.deleteMany() It is used to delete multiple documents from the collection
that satisfy the given criteria.

Example: In this example, we are deleting all the documents from the student collection
using db.collection.deleteMany() method.
Q2(e) Differentiate between Map-Reduce, PIG and HIVE
Answer:
Pig Hive Hadoop mapreduce
Pig is a scripting language SQL like query language It is a compiled language
Higher level of abstraction Higher level of abstraction Hadoop mapreduce used lower
level of abstraction
Comparatively less line of Comparatively less lines of More lines of code
codes code
than mapreduce than mapreduce and
apache pig
Development effort is less Development effort is less More development , effort is
code efficiency code involved
is relatively less efficiency is relatively less
Code efficiency is relatively Code efficiency is Code efficiency is high when
less relatively less compared
to pig and hive
Pig is open source Hive open source Hadoop MapReduce were built
so that hadoop developers could
do the same
thing in Java in a less verbose
way by writing only
fewer lines of code that is easy to
understand
Disadvantage of pig is that Disadvantages of hive is The only drawback that
commands that No developers need to write several
are not executed unless real time access to data, lines of basic java code
either you dump Updating
or store an intermediate or data is complicated
final result.This
increases the iteration
between debug and
resolving the issue.
SECTION C
Q3(a) Discuss in detail the different forms of BIG data.
Answer:
Structured Data
Structured data is the data which conforms to a data model, has a well define structure,
follows a consistent order, and can be easily accessed and used by a person or a computer
program. Structured data is usually stored in well-defined schemas such as Databases. It is
generally tabular with column and rows that clearly define its attributes.
SQL (Structured Query language) is often used to manage structured data stored in databases.
Characteristics of Structured Data
 Data conforms to a data model and has easily identifiable structure
 Data is stored in the form of rows and columns
 Data is well organised so, Definition, Format and Meaning of data is explicitly known
 Data resides in fixed fields within a record or file
 Similar entities are grouped together to form relations or classes
 Entities in the same group have same attributes
 Easy to access and query, so data can be easily used by other programs
 Data elements are addressable, so efficient to analyse and process
Sources of Structured Data:
 SQL Databases
 Spreadsheets such as Excel
 OLTP Systems
 Online forms
 Sensors such as GPS or RFID tags
 Network and Web server logs
 Medical devices
Advantages of Structured Data
 Structured data have a well-defined structure that helps in easy storage and access of
data.
 Data can be indexed based on text string as well as attributes. This makes search
operation hassle-free.
 Data mining is easy i.e knowledge can be easily extracted from data
 Operations such as Updating and deleting is easy due to well structured form of data
 Business Intelligence operations such as Data warehousing can be easily undertaken
 Easily scalable in case there is an increment of data
 Ensuring security to data is easy
Unstructured Data
Unstructured data is the data which does not conforms to a data model and has no easily
identifiable structure such that it cannot be used by a computer program easily. Unstructured
data is not organised in a pre-defined manner or does not have a pre-defined data model; thus
it is not a good fit for a mainstream relational database.
Characteristics of Unstructured Data:
 Data neither conforms to a data model nor has any structure.
 Data cannot be stored in the form of rows and columns as in Databases
 Data does not follow any semantic or rules
 Data lacks any particular format or sequence
 Data has no easily identifiable structure
 Due to lack of identifiable structure, it cannot used by computer programs easily
Sources of Unstructured Data
 Web pages
 Images (JPEG, GIF, PNG, etc.)
 Videos
 Memos
 Reports
 Word documents and PowerPoint presentations
 Surveys
Advantages of Unstructured Data
 It supports the data which lacks a proper format or sequence
 The data is not constrained by a fixed schema
 Very Flexible due to absence of schema.
 Data is portable
 It is very scalable
 It can deal easily with the heterogeneity of sources.
 These types of data have a variety of business intelligence and analytics applications.
Disadvantages of Unstructured Data
 It is difficult to store and manage unstructured data due to lack of schema and structure
 Indexing the data is difficult and error prone due to unclear structure and not having
pre-defined attributes. Due to which search results are not very accurate.
 Ensuring security to data is difficult task.
Semi-Structured Data
Semi-structured data is data that does not conform to a data model but has some structure. It
lacks a fixed or rigid schema. It is the data that does not reside in a rational database but that
have some organizational properties that make it easier to analyze. With some processes, we
can store them in the relational database.
Characteristics of Semi-structured Data
 Data does not conform to a data model but has some structure.
 Data cannot be stored in the form of rows and columns as in Databases
 Semi-structured data contains tags and elements (Metadata) which is used to group data
and describe how the data is stored
 Similar entities are grouped together and organized in a hierarchy
 Entities in the same group may or may not have the same attributes or properties
 Does not contain sufficient metadata which makes automation and management of data
difficult
 Size and type of the same attributes in a group may differ
 Due to lack of a well-defined structure, it cannot be used by computer programs easily
Sources of Semi-structured Data
 E-mails
 XML and other markup languages
 Binary executables
 TCP/IP packets
 Zipped files
 Integration of data from different sources
 Web pages
Advantages of Semi-structured Data
 The data is not constrained by a fixed schema
 Flexible i.e, schema can be easily changed.
 Data is portable
 It is possible to view structured data as semi-structured data
 Its supports users who cannot express their need in SQL
 It can deal easily with the heterogeneity of sources.
Disadvantages of Semi-structured Data
 Lack of fixed, rigid schema makes it difficult in storage of the data
 Interpreting the relationship between data is difficult as there is no separation of the
schema and the data.
 Queries are less efficient as compared to structured data.
Q3(b) Elaborate various components of Big Data architecture.
Answer:
Big data architecture is a comprehensive system of processing a vast amount of data. It is a
framework that lays out the blueprint of providing solutions and infrastructures to handle big
data depending on an organization’s needs. It clearly defines the components, layers to be used,
and the flow of information. The reference point is ingesting, processing, storing, managing,
accessing, and analyzing the data. A typical big data architecture looks like below, having the
following layers:
Big Data Sources Layer
The architecture heavily depends on the type of data and its sources. The data sources are both
open and third-party data providers. Several data sources range from relational database
management systems, data warehouses, cloud-based data warehouses, SaaS applications, real-
time data from the company servers and sensors such as IoT devices, third-party data providers,
and also static files such as Windows logs. The data managed can be both batch processing and
real-time processing (more on this below).
Storage Layer
The storage layer is the second layer in the architecture receiving the data for the big data. It
provides infrastructure for storing past data, converts it into the required format, and stores it
in that format. For instance, the structured data is only stored in RDBMS and Hadoop
Distributed File System (HDFS) to store the batch processing data. Typically, the information
is stored in the data lake according to the system’s requirements.
Batch processing and real-time processing Layer
The big data architecture needs to incorporate both types of data: batch (static) data and real-
time or streaming data.
 Batch processing: It is needed to manage such data (in gigabytes) efficiently. In batch
processing, the data is filtered, aggregated, processed, and prepared for analysis. The
batches are long-running jobs. The way the batch process works is to read the data into
the storage layer, process it, and write the outputs into new files. The solution for batch
time processing is Hadoop.
 Real-Time Processing: It is required for capturing, storing, and processing the data on
a real-time basis. Real-time processing first ingests the data and then uses that as a
“publish-subscribe kind of a tool.”
Stream processing
Stream processing varies from real-time message ingestion. Stream processing handles all the
window streaming data or even streams and writes the streamed data to the output area. The
tools here are Apache Spark, Apache Flink, Storm.
Analytical datastore
The analytical data store is like a one-stop place for the prepared data. This data is either
presented in an interactive database that offers the metadata or in a NoSQL data warehouse.
The data set then prepared can be searched for by querying and used for analysis with tools
such as Hive, Spark SQL, Hbase.
Analytics Layer
The analysis layer interacts with the storage layer to extract valuable insights and business
intelligence. The architecture needs a mix of multiple data tools for handling the unstructured
data and making the analysis. The tools and techniques for big data architecture are covered in
a later section. On this subject, you may also like to read: Big Data Analytics – Key Aspects
One Must Know
Consumption or Business Intelligence (BI) Layer
This layer is the output of the big data architecture. It receives the final analysis from the
analytics layer and presents and replicates it to the relevant output system. The results acquired
are used for making decisions and for visualization. It is also referred to as the business
intelligence (BI) layer.
Q4(a) Explain the detailed architecture of Map-Reduce
Answer:
MapReduce and HDFS are the two major components of Hadoop which makes it so powerful
and efficient to use. MapReduce is a programming model used for efficient processing in
parallel over large data-sets in a distributed manner. The data is first split and then combined
to produce the final result. The libraries for MapReduce is written in so many programming
languages with various different-different optimizations. The purpose of MapReduce in
Hadoop is to Map each of the jobs and then it will reduce it to equivalent tasks for providing
less overhead over the cluster network and to reduce the processing power. The MapReduce
task is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:

Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result
of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing

MapReduce Architecture is fundamentally partitioned into two phases, for example, Map phase
and Reduce phase.
Map: As the name proposes, its principle use is to plan the input data in key-esteem sets. The
contribution to the map might be a key-esteem pair where the key can be the id of some sort of
address, and worth is the real value that it keeps. The Map () capacity will be executed in its
memory vault on every one of these input key-esteem pairs and creates the moderate key-
esteem pair, which fills in as a contribution for the Reducer or Reduce () work.
Reduce: The middle of the key-esteem combines that fill in as contribution for Reducer are
send and sort and shuffled off the Reduce () work. Reducer total or gathering the data-
dependent on its key-esteem pair according to the reducer calculation composed by the
developer.
How Task Tracker and the Job tracker manage MapReduce Architecture:
Task Tracker: It can be considered as the real slaves that are dealing with the guidance given
by the Job Tracker. This Task Tracker is conveyed on every one of the nodes accessible in the
cluster that executes the MapReduce task as taught by Job Tracker.
Job Tracker: It is to deal with all the jobs and all the resources across the cluster and to plan
each guide on the Task Tracker running on a similar information hub since there can be many
data nodes accessible in the cluster.
There is additionally one significant segment of MapReduce Architecture known as Job
History Server. The Job History Server is a daemon cycle that recoveries and stores authentic
data about the application or task, similar to the logs which are created after or during the work
execution are put away on Job History Server.
Hadoop MapReduce architecture presently has become a famous solution for the present world
necessities. The plan of Hadoop remembers different objectives. Hadoop MapReduce
architecture diagram that encourages you to comprehend it better.
Hadoop MapReduce framework architecture includes three significant layers. They are:
 HDFS- Hadoop Distributed File System: NameNode and DataNode, Block in HDFS,
and Replication Management.
 Yarn: Scheduler, and Application Manager.
 MapReduce: Map Task, and Reduce Task.

Q4(b) Differentiate “Scale up and Scale out” Explain with an example How Hadoop uses
Scale out feature to improve the Performance.
Answer:
Scale up
Scale up means adding resources to a single node in the system. Just like adding a hard drive
to a PC, it has been a common way for companies to upgrade storage over the past few decades.
However, over time, this method has exposed more and more limitations.
For storage systems, vertical expansion only adds hard disks or flash disks to the existing
architecture to increase storage capacity, but does not increase CPU and memory to help the
entire system handle more capacity and deliver it to the host. This means that when storage
capacity increases, storage performance tends to decrease.

For example, the current array processing performance has reached the bottleneck. At this
time, expanding the capacity will affect the overall performance of the LUNs previously
mapped to the host, because more LUNs compete for system resources that have reached the
bottleneck. In turn, it also affects backup and recovery time and other mission-critical
processes.

Scale Out

Scale out is the process of replacing or adding new hardware to an existing IT system. While
expanding the capacity, the performance increases linearly with the capacity. Because each
node of the expansion has an independent CPU, independent memory, etc., after the space is
expanded, the performance of the entire cluster will not decrease with the increase of
capacity, but will increase.

The horizontal expansion design is suitable for unstructured data, where the data can be
distributed on multiple nodes to improve flexibility and performance. Generally, with this
type of data, the I/O profile does not require block-level deterministic delay (just like block-
based I/O). This is why the prominent design in scale-out and object storage solutions. The
scale-out solution also allows the capacity of a single volume to exceed a single node, so
object storage or file systems that need to support large-capacity are very suitable for scale-
out designs.

Benefits of Scale Out

In essence, Scale Out can easily solve the deficiencies of vertical expansion in the past. Its
core advantages include:

– Get rid of the capacity and performance constraints of old equipment;


– Reduce complex infrastructure costs and quickly benefit from newer architectures and disk
drive densities without the need for expensive forklift upgrades;

– Better hardware simplifies system management, facilitates redundancy and improves


uptime;

– Make it relatively easy for the organization to actually scale up in the future. The
complexity of the traditional vertical scaling architecture may bring the risk of business
interruption during upgrades, while horizontal scaling is relatively easy.

In hadoop, we are using a scale out to improve the performance for below reasons:

1. Hadoop is used for big data and for this it needs lots of computation in different files.
It was basically designed with the concept of scale out the server when needed to
process large data.
2. By doing scale out the hadoop can handle large data by doing parallel processing of
the data in different servers.
3. It also makes use of storing the data in different servers and by this the parallel
processing on those data will improve the performance of the job.

Q5(a) Demonstrate the design and concept of HDFS in detail.


Answer: When a dataset outgrows the storage capacity of a single physical machine, it
becomes necessary to partition it across a number of separate machines. Filesystems that
manage the storage across a network of machines are called distributed filesystems. Hadoop
comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed
Filesystem.
The Design of HDFS :
HDFS is a filesystem designed for storing very large files with streaming data access patterns,
running on clusters of commodity hardware.
Very large files: “Very large” in this context means files that are hundreds of megabytes,
gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of
data.
Streaming data access: HDFS is built around the idea that the most efficient data processing
pattern is a write-once, readmany-times pattern. A dataset is typically generated or copied from
source, then various analyses are performed on that dataset over time.
Commodity hardware: Hadoop doesn’t require expensive, highly reliable hardware to run on.
It’s designed to run on clusters of commodity hardware (commonly available hardware
available from multiple vendors) for which the chance of node failure across the cluster is high,
at least for large clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure.
These are areas where HDFS is not a good fit today:
Low-latency data access: Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS.
Lots of small files: Since the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the namenode.
Multiple writers, arbitrary file modifications: Files in HDFS may be written to by a single
writer. Writes are always made at the end of the file. There is no support for multiple writers,
or for modifications at arbitrary offsets in the file.
HDFS Concepts
Blocks: HDFS has the concept of a block, but it is a much larger unit—64 MB by default. Files
in HDFS are broken into block-sized chunks, which are stored as independent units. Having a
block abstraction for a distributed filesystem brings several benefits.:
 The first benefit: A file can be larger than any single disk in the network. There’s
nothing that requires the blocks from a file to be stored on the same disk, so they can
take advantage of any of the disks in the cluster.
 Second: Making the unit of abstraction a block rather than a file simplifies the storage
subsystem. The storage subsystem deals with blocks, simplifying storage management
(since blocks are a fixed size, it is easy to calculate how many can be stored on a given
disk) and eliminating metadata concerns.
 Third: Blocks fit well with replication for providing fault tolerance and availability. To
insure against corrupted blocks and disk and machine failure, each block is replicated
to a small number of physically separate machines (typically three).
Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks.
By making a block large enough, the time to transfer the data from the disk can be made to be
significantly larger than the time to seek to the start of the block. Thus the time to transfer a
large file made of multiple blocks operates at the disk transfer rate. A quick calculation shows
that if the seek time is around 10 ms, and the transfer rate is 100 MB/s, then to make the seek
time 1% of the transfer time, we need to make the block size around 100 MB. The default is
actually 64 MB, although many HDFS installations use 128 MB blocks. This figure will
continue to be revised upward as transfer speeds grow with new generations of disk drives.

Q5(b) Write the benefits and challenges of HDFS


Answer: Hadoop is one of the tools to deal with this huge amount of data as it can easily extract
the information from data, Hadoop has its Advantages and Disadvantages while we deal with
Big Data.
Advantages
1. Cost: Hadoop is open-source and uses cost-effective commodity hardware which provides a
cost-efficient model, unlike traditional Relational databases that require expensive hardware
and high-end processors to deal with Big Data. The problem with traditional Relational
databases is that storing the Massive volume of data is not cost-effective, so the company’s
started to remove the Raw data. which may not result in the correct scenario of their business.
Means Hadoop provides us 2 main benefits with the cost one is it’s open-source means free to
use and the other is that it uses commodity hardware which is also inexpensive.
2. Scalability: Hadoop is a highly scalable model. A large amount of data is divided into
multiple inexpensive machines in a cluster which is processed parallelly. the number of these
machines or nodes can be increased or decreased as per the enterprise’s requirements. In
traditional RDBMS (Relational DataBase Management System) the systems cannot be scaled
to approach large amounts of data.
3. Flexibility: Hadoop is designed in such a way that it can deal with any kind of dataset like
structured (MySql Data), Semi-Structured(XML, JSON), Un-structured (Images and Videos)
very efficiently. This means it can easily process any kind of data independent of its structure
which makes it highly flexible. which is very much useful for enterprises as they can process
large datasets easily, so the businesses can use Hadoop to analyze valuable insights of data
from sources like social media, email, etc. with this flexibility Hadoop can be used with log
processing, Data Warehousing, Fraud detection, etc.
4. Speed: Hadoop uses a distributed file system to manage its storage i.e. HDFS (Hadoop
Distributed File System). In DFS (Distributed File System) a large size file is broken into small
size file blocks then distributed among the Nodes available in a Hadoop cluster, as this massive
number of file blocks are processed parallelly which makes Hadoop faster, because of which
it provides a High-level performance as compared to the traditional DataBase Management
Systems. When you are dealing with a large amount of unstructured data speed is an important
factor, with Hadoop you can easily access TB’s of data in just a few minutes.
5. Fault Tolerance: Hadoop uses commodity hardware (inexpensive systems) which can be
crashed at any moment. In Hadoop data is replicated on various DataNodes in a Hadoop cluster
which ensures the availability of data if somehow any of your systems got crashed. You can
read all of the data from a single machine if this machine faces a technical issue data can also
be read from other nodes in a Hadoop cluster because the data is copied or replicated by default.
Hadoop makes 3 copies of each file block and stored it into different nodes.
6. High Throughput: Hadoop works on Distributed File System where various jobs are assigned
to various Data node in a cluster, the bar of this data is processed parallelly in the Hadoop
cluster which produces high throughput. Throughput is nothing but the task or job done per
unit time.
7. Minimum Network Traffic: In Hadoop, each task is divided into various small sub-task
which is then assigned to each data node available in the Hadoop cluster. Each data node
processes a small amount of data which leads to low traffic in a Hadoop cluster.

Disadvantages
1. Problem with Small files: Hadoop can efficiently perform over a small number of files of
large size. Hadoop stores the file in the form of file blocks which are from 128MB in size (by
default) to 256MB. Hadoop fails when it needs to access the small size file in a large amount.
This so many small files surcharge the Namenode and make it difficult to work.
2. Vulnerability: Hadoop is a framework that is written in java, and java is one of the most
commonly used programming languages which makes it more insecure as it can be easily
exploited by any of the cyber-criminal.
3. Low Performance in Small Data Surrounding: Hadoop is mainly designed for dealing with
large datasets, so it can be efficiently utilized for the organizations that are generating a massive
volume of data. It’s efficiency decreases while performing in small data surroundings.
4. Lack of Security: Data is everything for an organization, by default the security feature in
Hadoop is made un-available. So the Data driver needs to be careful with this security face and
should take appropriate action on it. Hadoop uses Kerberos for security feature which is not
easy to manage. Storage and network encryption are missing in Kerberos which makes us more
concerned about it.
5. High Up Processing: Read/Write operation in Hadoop is immoderate since we are dealing
with large size data that is in TB or PB. In Hadoop, the data read or write done from the disk
which makes it difficult to perform in-memory calculation and lead to processing overhead or
High up processing.
6. Supports Only Batch Processing: The batch process is nothing but the processes that are
running in the background and does not have any kind of interaction with the user. The engines
used for these processes inside the Hadoop core is not that much efficient. Producing the output
with low latency is not possible with it.

Q6(a) Classify and detail the different types of NoSQL


Answer: NoSQL Database is a non-relational Data Management System, that does not require
a fixed schema. It avoids joins, and is easy to scale. The major purpose of using a NoSQL
database is for distributed data stores with humongous data storage needs. NoSQL is used for
Big data and real-time web apps. For example, companies like Twitter, Facebook and Google
collect terabytes of user data every single day.
NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better term would be
“NoREL”, NoSQL caught on. Carl Strozz introduced the NoSQL concept in 1998.

Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a
NoSQL database system encompasses a wide range of database technologies that can store
structured, semi-structured, unstructured and polymorphic data. Let’s understand about
NoSQL with a diagram in this NoSQL database tutorial:

Types of NoSQL Databases


NoSQL Databases are mainly categorized into four types: Key-value pair, Column-oriented,
Graph-based and Document-oriented. Every category has its unique attributes and limitations.
None of the above-specified database is better to solve all the problems. Users should select
the database based on their product needs.
Types of NoSQL Databases:
 Key-value Pair Based
 Column-oriented Graph
 Graphs based
 Document-oriented

Key Value Pair Based


Data is stored in key/value pairs. It is designed in such a way to handle lots of data and heavy
load.
Key-value pair storage databases store data as a hash table where each key is unique, and the
value can be a JSON, BLOB(Binary Large Objects), string, etc.
For example, a key-value pair may contain a key like “Website” associated with a value like
“Guru99”.

It is one of the most basic NoSQL database example. This kind of NoSQL database is used as
a collection, dictionaries, associative arrays, etc. Key value stores help the developer to store
schema-less data. They work best for shopping cart contents.
Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases. They are all
based on Amazon’s Dynamo paper.
Column-based
Column-oriented databases work on columns and are based on BigTable paper by Google.
Every column is treated separately. Values of single column databases are stored contiguously.

Column based NoSQL database


They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc. as
the data is readily available in a column.
Column-based NoSQL databases are widely used to manage data warehouses, business
intelligence, CRM, Library card catalogs,
HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column based database.
Document-Oriented:
Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value part
is stored as a document. The document is stored in JSON or XML formats. The value is
understood by the DB and can be queried.

Relational Vs. Document


In this diagram on your left you can see we have rows and columns, and in the right, we have
a document database which has a similar structure to JSON. Now for the relational database,
you have to know what columns you have and so on. However, for a document database, you
have data store like JSON object. You do not require to define which make it flexible.
The document type is mostly used for CMS systems, blogging platforms, real-time analytics &
e-commerce applications. It should not use for complex transactions which require multiple
operations or queries against varying aggregate structures.
Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular
Document originated DBMS systems.
Graph-Based
A graph type database stores entities as well the relations amongst those entities. The entity is
stored as a node with the relationship as edges. An edge gives a relationship between nodes.
Every node and edge has a unique identifier.

Compared to a relational database where tables are loosely connected, a Graph database is a
multi-relational in nature. Traversing relationship is fast as they are already captured into the
DB, and there is no need to calculate them.
Graph base database mostly used for social networks, logistics, spatial data.
Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.

Q6(b) Summarize the role of indexing in MongoDB using an example.


Answer:
MongoDB is leading NoSQL database written in C++. It is high scalable and provides high
performance and availability. It works on the concept of collections and documents.
Collection in MongoDB is group of related documents that are bound together. The collection
does not follow any schema which is one of the remarkable feature of MongoDB.
Indexing in MongoDB :
MongoDB uses indexing in order to make the query processing more efficient. If there is no
indexing, then the MongoDB must scan every document in the collection and retrieve only
those documents that match the query. Indexes are special data structures that stores some
information related to the documents such that it becomes easy for MongoDB to find the
right data file. The indexes are order by the value of the field specified in the index.

Creating an Index :
MongoDB provides a method called createIndex() that allows user to create an index.

Syntax –
db.COLLECTION_NAME.createIndex({KEY:1})
The key determines the field on the basis of which you want to create an index and 1 (or -1)
determines the order in which these indexes will be arranged(ascending or descending).
Example –
db.mycol.createIndex({“age”:1})
{
“createdCollectionAutomatically” : false,
“numIndexesBefore” : 1,
“numIndexesAfter” : 2,
“ok” : 1
}
The createIndex() method also has a number of optional parameters.
These include:

 background (Boolean)
 unique (Boolean)
 name (string)
 sparse (Boolean)
 Drop an Index

In order to drop an index, MongoDB provides the dropIndex() method.


Syntax –

db.NAME_OF_COLLECTION.dropIndex({KEY:1})
The dropIndex() methods can only delete one index at a time. In order to delete (or drop)
multiple indexes from the collection, MongoDB provides the dropIndexes() method that
takes multiple indexes as its parameters.
Syntax –

db.NAME_OF_COLLECTION.dropIndexes({KEY1:1, KEY2, 1})


The dropIndex() methods can only delete one index at a time. In order to delete (or drop)
multiple indexes from the collection, MongoDB provides the dropIndexes() method that
takes multiple indexes as its parameters.
Get description of all indexes :
The getIndexes() method in MongoDB gives a description of all the indexes that exists in
the given collection.
Syntax –
db.NAME_OF_COLLECTION.getIndexes()
It will retrieve all the description of the indexes created within the collection.
Q7(a) Explore various execution models of PIG.
Answer: Apache Pig Execution Modes
You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.
 Local Mode: In this mode, all the files are installed and run from your local host and
local file system. There is no need of Hadoop or HDFS. This mode is generally used
for testing purpose.
 MapReduce Mode: MapReduce mode is where we load or process the data that
exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever
we execute the Pig Latin statements to process the data, a MapReduce job is invoked
in the back-end to perform a particular operation on the data that exists in the HDFS.

Apache Pig Execution Mechanisms

Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and
embedded mode.
 Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using
the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output
(using Dump operator).
 Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin
script in a single file with .pig extension.
 Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and using
them in our script.

Invoking the Grunt Shell

You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as
shown below.
Local mode MapReduce mode
Command − Command −
$ ./pig –x local $ ./pig -x mapreduce

Output − Output −
Either of these commands gives you the Grunt shell prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’.
After invoking the Grunt shell, you can execute a Pig script by directly entering the Pig Latin
statements in it.
grunt> customers = LOAD 'customers.txt' USING PigStorage(',');

Executing Apache Pig in Batch Mode

You can write an entire Pig Latin script in a file and execute it using the –x command. Let us
suppose we have a Pig script in a file named sample_script.pig as shown below.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',') as (id:int,name:chararray,city:chararray);

Dump student;
Now, you can execute the script in the above file as shown below.
Local mode MapReduce mode
$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig

Q7(b) Design and explain the detailed architecture of HIVE.


Answer:
The following component diagram depicts the architecture of Hive:
This component diagram contains different units. The following table describes each unit:
Unit Name Operation
User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that
Hive supports are Hive Web UI, Hive command line, and Hive
HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data
types, and HDFS mapping.
HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the
Metastore. It is one of the replacements of traditional approach
for MapReduce program. Instead of writing MapReduce program
in Java, we can write a query for MapReduce job and process it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce
is Hive Execution Engine. Execution engine processes the query
and generates results as same as MapReduce results. It uses the
flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.

Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:

Step No. Operation


1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver
(any database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check the
syntax and query plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Metastore (any database).
4 Send Metadata
Metastore sends metadata as a response to the compiler.
5 Send Plan
The compiler checks the requirement and resends the plan to the driver. Up
to here, the parsing and compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution engine.
7 Execute Job
Internally, the process of execution job is a MapReduce job. The execution
engine sends the job to JobTracker, which is in Name node and it assigns this
job to TaskTracker, which is in Data node. Here, the query executes
MapReduce job.
7.1 Metadata Ops
Meanwhile in execution, the execution engine can execute metadata
operations with Metastore.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.

You might also like