KCS061 Solution
KCS061 Solution
VOLUME
Within the Social Media space for example, Volume refers to the amount of data generated
through websites, portals and online applications. Especially for B2C companies, Volume
encompasses the available data that are out there and need to be assessed for relevance.
Consider the following -Facebook has 2 billion users, Youtube 1 billion users, Twitter 350
million users and Instagram 700 million users. Every day, these users contribute to billions of
images, posts, videos, tweets etc. You can now imagine the insanely large amount -or Volume-
of data that is generated every minute and every hour.
VELOCITY
With Velocity we refer to the speed with which data are being generated. Staying with our
social media example, every day 900 million photos are uploaded on Facebook, 500 million
tweets are posted on Twitter, 0.4 million hours of video are uploaded on Youtube and 3.5
billion searches are performed in Google. This is like a nuclear data explosion. Big Data helps
the company to hold this explosion, accept the incoming flow of data and at the same time
process it fast so that it does not create bottlenecks.
VARIETY
Variety in Big Data refers to all the structured and unstructured data that has the possibility of
getting generated either by humans or by machines. The most commonly added data are
structured -texts, tweets, pictures & videos. However, unstructured data like emails,
voicemails, hand-written text, ECG reading, audio recordings etc, are also important elements
under Variety. Variety is all about the ability to classify the incoming data into various
categories.
Q2(b) Explain the role of Map-Reduce architecture also explain its advantage and
disadvantages in Map-Reduce framework.
Answer:
MapReduce and HDFS are the two major components of Hadoop which makes it so powerful
and efficient to use. MapReduce is a programming model used for efficient processing in
parallel over large data-sets in a distributed manner. The data is first split and then combined
to produce the final result. The libraries for MapReduce is written in so many programming
languages with various different-different optimizations. The purpose of MapReduce in
Hadoop is to Map each of the jobs and then it will reduce it to equivalent tasks for providing
less overhead over the cluster network and to reduce the processing power. The MapReduce
task is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result
of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing
Q2(d) With the help of suitable example, explain how CRUD operations are performed
in MongoDB.
Answer: MongoDB provides a set of some basic but most essential operations that will help
you to easily interact with the MongoDB server and these operations are known as CRUD
operations.
Create Operations
The create or insert operations are used to insert or add new documents in the collection. If a
collection does not exist, then it will create a new collection in the database. You can perform,
create operations using the following methods provided by the MongoDB:
Method Description
db.collection.insertOne() It is used to insert a single document in the collection.
db.collection.insertMany() It is used to insert multiple documents in the collection.
Example: In this example, we are inserting details of a single student in the form of document
in the student collection using db.collection.insertOne() method.
Read Operations
The Read operations are used to retrieve documents from the collection, or in other words,
read operations are used to query a collection for a document. You can perform read operation
using the following method provided by the MongoDB:
Method Description
db.collection.find() It is used to retrieve documents from the collection.
Example: In this example, we are retrieving the details of students from the student
collection using db.collection.find() method.
Update Operations –
The update operations are used to update or modify the existing document in the collection.
You can perform update operations using the following methods provided by the
MongoDB:
Method Description
db.collection.updateOne() It is used to update a single document in the collection
that satisfy the given criteria.
db.collection.updateMany() It is used to update multiple documents in the collection
that satisfy the given criteria.
db.collection.replaceOne() It is used to replace single document in the collection that
satisfy the given criteria.
Example: In this example, we are updating the age of Sumit in the student collection
using db.collection.updateOne() method.
Delete Operations –
The delete operation are used to delete or remove the documents from a collection. You can
perform delete operations using the following methods provided by the MongoDB:
Method Description
db.collection.deleteOne() It is used to delete a single document from the collection
that satisfy the given criteria.
db.collection.deleteMany() It is used to delete multiple documents from the collection
that satisfy the given criteria.
Example: In this example, we are deleting all the documents from the student collection
using db.collection.deleteMany() method.
Q2(e) Differentiate between Map-Reduce, PIG and HIVE
Answer:
Pig Hive Hadoop mapreduce
Pig is a scripting language SQL like query language It is a compiled language
Higher level of abstraction Higher level of abstraction Hadoop mapreduce used lower
level of abstraction
Comparatively less line of Comparatively less lines of More lines of code
codes code
than mapreduce than mapreduce and
apache pig
Development effort is less Development effort is less More development , effort is
code efficiency code involved
is relatively less efficiency is relatively less
Code efficiency is relatively Code efficiency is Code efficiency is high when
less relatively less compared
to pig and hive
Pig is open source Hive open source Hadoop MapReduce were built
so that hadoop developers could
do the same
thing in Java in a less verbose
way by writing only
fewer lines of code that is easy to
understand
Disadvantage of pig is that Disadvantages of hive is The only drawback that
commands that No developers need to write several
are not executed unless real time access to data, lines of basic java code
either you dump Updating
or store an intermediate or data is complicated
final result.This
increases the iteration
between debug and
resolving the issue.
SECTION C
Q3(a) Discuss in detail the different forms of BIG data.
Answer:
Structured Data
Structured data is the data which conforms to a data model, has a well define structure,
follows a consistent order, and can be easily accessed and used by a person or a computer
program. Structured data is usually stored in well-defined schemas such as Databases. It is
generally tabular with column and rows that clearly define its attributes.
SQL (Structured Query language) is often used to manage structured data stored in databases.
Characteristics of Structured Data
Data conforms to a data model and has easily identifiable structure
Data is stored in the form of rows and columns
Data is well organised so, Definition, Format and Meaning of data is explicitly known
Data resides in fixed fields within a record or file
Similar entities are grouped together to form relations or classes
Entities in the same group have same attributes
Easy to access and query, so data can be easily used by other programs
Data elements are addressable, so efficient to analyse and process
Sources of Structured Data:
SQL Databases
Spreadsheets such as Excel
OLTP Systems
Online forms
Sensors such as GPS or RFID tags
Network and Web server logs
Medical devices
Advantages of Structured Data
Structured data have a well-defined structure that helps in easy storage and access of
data.
Data can be indexed based on text string as well as attributes. This makes search
operation hassle-free.
Data mining is easy i.e knowledge can be easily extracted from data
Operations such as Updating and deleting is easy due to well structured form of data
Business Intelligence operations such as Data warehousing can be easily undertaken
Easily scalable in case there is an increment of data
Ensuring security to data is easy
Unstructured Data
Unstructured data is the data which does not conforms to a data model and has no easily
identifiable structure such that it cannot be used by a computer program easily. Unstructured
data is not organised in a pre-defined manner or does not have a pre-defined data model; thus
it is not a good fit for a mainstream relational database.
Characteristics of Unstructured Data:
Data neither conforms to a data model nor has any structure.
Data cannot be stored in the form of rows and columns as in Databases
Data does not follow any semantic or rules
Data lacks any particular format or sequence
Data has no easily identifiable structure
Due to lack of identifiable structure, it cannot used by computer programs easily
Sources of Unstructured Data
Web pages
Images (JPEG, GIF, PNG, etc.)
Videos
Memos
Reports
Word documents and PowerPoint presentations
Surveys
Advantages of Unstructured Data
It supports the data which lacks a proper format or sequence
The data is not constrained by a fixed schema
Very Flexible due to absence of schema.
Data is portable
It is very scalable
It can deal easily with the heterogeneity of sources.
These types of data have a variety of business intelligence and analytics applications.
Disadvantages of Unstructured Data
It is difficult to store and manage unstructured data due to lack of schema and structure
Indexing the data is difficult and error prone due to unclear structure and not having
pre-defined attributes. Due to which search results are not very accurate.
Ensuring security to data is difficult task.
Semi-Structured Data
Semi-structured data is data that does not conform to a data model but has some structure. It
lacks a fixed or rigid schema. It is the data that does not reside in a rational database but that
have some organizational properties that make it easier to analyze. With some processes, we
can store them in the relational database.
Characteristics of Semi-structured Data
Data does not conform to a data model but has some structure.
Data cannot be stored in the form of rows and columns as in Databases
Semi-structured data contains tags and elements (Metadata) which is used to group data
and describe how the data is stored
Similar entities are grouped together and organized in a hierarchy
Entities in the same group may or may not have the same attributes or properties
Does not contain sufficient metadata which makes automation and management of data
difficult
Size and type of the same attributes in a group may differ
Due to lack of a well-defined structure, it cannot be used by computer programs easily
Sources of Semi-structured Data
E-mails
XML and other markup languages
Binary executables
TCP/IP packets
Zipped files
Integration of data from different sources
Web pages
Advantages of Semi-structured Data
The data is not constrained by a fixed schema
Flexible i.e, schema can be easily changed.
Data is portable
It is possible to view structured data as semi-structured data
Its supports users who cannot express their need in SQL
It can deal easily with the heterogeneity of sources.
Disadvantages of Semi-structured Data
Lack of fixed, rigid schema makes it difficult in storage of the data
Interpreting the relationship between data is difficult as there is no separation of the
schema and the data.
Queries are less efficient as compared to structured data.
Q3(b) Elaborate various components of Big Data architecture.
Answer:
Big data architecture is a comprehensive system of processing a vast amount of data. It is a
framework that lays out the blueprint of providing solutions and infrastructures to handle big
data depending on an organization’s needs. It clearly defines the components, layers to be used,
and the flow of information. The reference point is ingesting, processing, storing, managing,
accessing, and analyzing the data. A typical big data architecture looks like below, having the
following layers:
Big Data Sources Layer
The architecture heavily depends on the type of data and its sources. The data sources are both
open and third-party data providers. Several data sources range from relational database
management systems, data warehouses, cloud-based data warehouses, SaaS applications, real-
time data from the company servers and sensors such as IoT devices, third-party data providers,
and also static files such as Windows logs. The data managed can be both batch processing and
real-time processing (more on this below).
Storage Layer
The storage layer is the second layer in the architecture receiving the data for the big data. It
provides infrastructure for storing past data, converts it into the required format, and stores it
in that format. For instance, the structured data is only stored in RDBMS and Hadoop
Distributed File System (HDFS) to store the batch processing data. Typically, the information
is stored in the data lake according to the system’s requirements.
Batch processing and real-time processing Layer
The big data architecture needs to incorporate both types of data: batch (static) data and real-
time or streaming data.
Batch processing: It is needed to manage such data (in gigabytes) efficiently. In batch
processing, the data is filtered, aggregated, processed, and prepared for analysis. The
batches are long-running jobs. The way the batch process works is to read the data into
the storage layer, process it, and write the outputs into new files. The solution for batch
time processing is Hadoop.
Real-Time Processing: It is required for capturing, storing, and processing the data on
a real-time basis. Real-time processing first ingests the data and then uses that as a
“publish-subscribe kind of a tool.”
Stream processing
Stream processing varies from real-time message ingestion. Stream processing handles all the
window streaming data or even streams and writes the streamed data to the output area. The
tools here are Apache Spark, Apache Flink, Storm.
Analytical datastore
The analytical data store is like a one-stop place for the prepared data. This data is either
presented in an interactive database that offers the metadata or in a NoSQL data warehouse.
The data set then prepared can be searched for by querying and used for analysis with tools
such as Hive, Spark SQL, Hbase.
Analytics Layer
The analysis layer interacts with the storage layer to extract valuable insights and business
intelligence. The architecture needs a mix of multiple data tools for handling the unstructured
data and making the analysis. The tools and techniques for big data architecture are covered in
a later section. On this subject, you may also like to read: Big Data Analytics – Key Aspects
One Must Know
Consumption or Business Intelligence (BI) Layer
This layer is the output of the big data architecture. It receives the final analysis from the
analytics layer and presents and replicates it to the relevant output system. The results acquired
are used for making decisions and for visualization. It is also referred to as the business
intelligence (BI) layer.
Q4(a) Explain the detailed architecture of Map-Reduce
Answer:
MapReduce and HDFS are the two major components of Hadoop which makes it so powerful
and efficient to use. MapReduce is a programming model used for efficient processing in
parallel over large data-sets in a distributed manner. The data is first split and then combined
to produce the final result. The libraries for MapReduce is written in so many programming
languages with various different-different optimizations. The purpose of MapReduce in
Hadoop is to Map each of the jobs and then it will reduce it to equivalent tasks for providing
less overhead over the cluster network and to reduce the processing power. The MapReduce
task is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result
of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing
MapReduce Architecture is fundamentally partitioned into two phases, for example, Map phase
and Reduce phase.
Map: As the name proposes, its principle use is to plan the input data in key-esteem sets. The
contribution to the map might be a key-esteem pair where the key can be the id of some sort of
address, and worth is the real value that it keeps. The Map () capacity will be executed in its
memory vault on every one of these input key-esteem pairs and creates the moderate key-
esteem pair, which fills in as a contribution for the Reducer or Reduce () work.
Reduce: The middle of the key-esteem combines that fill in as contribution for Reducer are
send and sort and shuffled off the Reduce () work. Reducer total or gathering the data-
dependent on its key-esteem pair according to the reducer calculation composed by the
developer.
How Task Tracker and the Job tracker manage MapReduce Architecture:
Task Tracker: It can be considered as the real slaves that are dealing with the guidance given
by the Job Tracker. This Task Tracker is conveyed on every one of the nodes accessible in the
cluster that executes the MapReduce task as taught by Job Tracker.
Job Tracker: It is to deal with all the jobs and all the resources across the cluster and to plan
each guide on the Task Tracker running on a similar information hub since there can be many
data nodes accessible in the cluster.
There is additionally one significant segment of MapReduce Architecture known as Job
History Server. The Job History Server is a daemon cycle that recoveries and stores authentic
data about the application or task, similar to the logs which are created after or during the work
execution are put away on Job History Server.
Hadoop MapReduce architecture presently has become a famous solution for the present world
necessities. The plan of Hadoop remembers different objectives. Hadoop MapReduce
architecture diagram that encourages you to comprehend it better.
Hadoop MapReduce framework architecture includes three significant layers. They are:
HDFS- Hadoop Distributed File System: NameNode and DataNode, Block in HDFS,
and Replication Management.
Yarn: Scheduler, and Application Manager.
MapReduce: Map Task, and Reduce Task.
Q4(b) Differentiate “Scale up and Scale out” Explain with an example How Hadoop uses
Scale out feature to improve the Performance.
Answer:
Scale up
Scale up means adding resources to a single node in the system. Just like adding a hard drive
to a PC, it has been a common way for companies to upgrade storage over the past few decades.
However, over time, this method has exposed more and more limitations.
For storage systems, vertical expansion only adds hard disks or flash disks to the existing
architecture to increase storage capacity, but does not increase CPU and memory to help the
entire system handle more capacity and deliver it to the host. This means that when storage
capacity increases, storage performance tends to decrease.
For example, the current array processing performance has reached the bottleneck. At this
time, expanding the capacity will affect the overall performance of the LUNs previously
mapped to the host, because more LUNs compete for system resources that have reached the
bottleneck. In turn, it also affects backup and recovery time and other mission-critical
processes.
Scale Out
Scale out is the process of replacing or adding new hardware to an existing IT system. While
expanding the capacity, the performance increases linearly with the capacity. Because each
node of the expansion has an independent CPU, independent memory, etc., after the space is
expanded, the performance of the entire cluster will not decrease with the increase of
capacity, but will increase.
The horizontal expansion design is suitable for unstructured data, where the data can be
distributed on multiple nodes to improve flexibility and performance. Generally, with this
type of data, the I/O profile does not require block-level deterministic delay (just like block-
based I/O). This is why the prominent design in scale-out and object storage solutions. The
scale-out solution also allows the capacity of a single volume to exceed a single node, so
object storage or file systems that need to support large-capacity are very suitable for scale-
out designs.
In essence, Scale Out can easily solve the deficiencies of vertical expansion in the past. Its
core advantages include:
– Make it relatively easy for the organization to actually scale up in the future. The
complexity of the traditional vertical scaling architecture may bring the risk of business
interruption during upgrades, while horizontal scaling is relatively easy.
In hadoop, we are using a scale out to improve the performance for below reasons:
1. Hadoop is used for big data and for this it needs lots of computation in different files.
It was basically designed with the concept of scale out the server when needed to
process large data.
2. By doing scale out the hadoop can handle large data by doing parallel processing of
the data in different servers.
3. It also makes use of storing the data in different servers and by this the parallel
processing on those data will improve the performance of the job.
Disadvantages
1. Problem with Small files: Hadoop can efficiently perform over a small number of files of
large size. Hadoop stores the file in the form of file blocks which are from 128MB in size (by
default) to 256MB. Hadoop fails when it needs to access the small size file in a large amount.
This so many small files surcharge the Namenode and make it difficult to work.
2. Vulnerability: Hadoop is a framework that is written in java, and java is one of the most
commonly used programming languages which makes it more insecure as it can be easily
exploited by any of the cyber-criminal.
3. Low Performance in Small Data Surrounding: Hadoop is mainly designed for dealing with
large datasets, so it can be efficiently utilized for the organizations that are generating a massive
volume of data. It’s efficiency decreases while performing in small data surroundings.
4. Lack of Security: Data is everything for an organization, by default the security feature in
Hadoop is made un-available. So the Data driver needs to be careful with this security face and
should take appropriate action on it. Hadoop uses Kerberos for security feature which is not
easy to manage. Storage and network encryption are missing in Kerberos which makes us more
concerned about it.
5. High Up Processing: Read/Write operation in Hadoop is immoderate since we are dealing
with large size data that is in TB or PB. In Hadoop, the data read or write done from the disk
which makes it difficult to perform in-memory calculation and lead to processing overhead or
High up processing.
6. Supports Only Batch Processing: The batch process is nothing but the processes that are
running in the background and does not have any kind of interaction with the user. The engines
used for these processes inside the Hadoop core is not that much efficient. Producing the output
with low latency is not possible with it.
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a
NoSQL database system encompasses a wide range of database technologies that can store
structured, semi-structured, unstructured and polymorphic data. Let’s understand about
NoSQL with a diagram in this NoSQL database tutorial:
It is one of the most basic NoSQL database example. This kind of NoSQL database is used as
a collection, dictionaries, associative arrays, etc. Key value stores help the developer to store
schema-less data. They work best for shopping cart contents.
Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases. They are all
based on Amazon’s Dynamo paper.
Column-based
Column-oriented databases work on columns and are based on BigTable paper by Google.
Every column is treated separately. Values of single column databases are stored contiguously.
Compared to a relational database where tables are loosely connected, a Graph database is a
multi-relational in nature. Traversing relationship is fast as they are already captured into the
DB, and there is no need to calculate them.
Graph base database mostly used for social networks, logistics, spatial data.
Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.
Creating an Index :
MongoDB provides a method called createIndex() that allows user to create an index.
Syntax –
db.COLLECTION_NAME.createIndex({KEY:1})
The key determines the field on the basis of which you want to create an index and 1 (or -1)
determines the order in which these indexes will be arranged(ascending or descending).
Example –
db.mycol.createIndex({“age”:1})
{
“createdCollectionAutomatically” : false,
“numIndexesBefore” : 1,
“numIndexesAfter” : 2,
“ok” : 1
}
The createIndex() method also has a number of optional parameters.
These include:
background (Boolean)
unique (Boolean)
name (string)
sparse (Boolean)
Drop an Index
db.NAME_OF_COLLECTION.dropIndex({KEY:1})
The dropIndex() methods can only delete one index at a time. In order to delete (or drop)
multiple indexes from the collection, MongoDB provides the dropIndexes() method that
takes multiple indexes as its parameters.
Syntax –
Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and
embedded mode.
Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using
the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output
(using Dump operator).
Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin
script in a single file with .pig extension.
Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and using
them in our script.
You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as
shown below.
Local mode MapReduce mode
Command − Command −
$ ./pig –x local $ ./pig -x mapreduce
Output − Output −
Either of these commands gives you the Grunt shell prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’.
After invoking the Grunt shell, you can execute a Pig script by directly entering the Pig Latin
statements in it.
grunt> customers = LOAD 'customers.txt' USING PigStorage(',');
You can write an entire Pig Latin script in a file and execute it using the –x command. Let us
suppose we have a Pig script in a file named sample_script.pig as shown below.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',') as (id:int,name:chararray,city:chararray);
Dump student;
Now, you can execute the script in the above file as shown below.
Local mode MapReduce mode
$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework: