20IT503 - Big Data Analytics - Unit4
20IT503 - Big Data Analytics - Unit4
20IT503 - Big Data Analytics - Unit4
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
20IT503
Big Data Analytics
Department: IT
Batch/Year: 2020-2024/ III
Date: 30.07.2022
Table of Contents
S NO CONTENTS PAGE NO
1 Contents 5
2 Course Objectives 6
3 Pre Requisites (Course Names with Code) 7
4 Syllabus (With Subject Code, Name, LTPC details) 8
5 Course Outcomes 10
6 CO- PO/PSO Mapping 11
7 Lecture Plan 12
8 Activity Based Learning 14
9 4 Introducing Hadoop 15
Hadoop Overview
4.1 15
10 Assignments 67
11 Part A (Questions & Answers) 63
12 Part B Questions 66
5
Course Objectives
CO1 2 3 3 3 3 1 1 - 1 2 1 1 2 2 2
CO2 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO3 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO4 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO5 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1
Lecture Plan
UNIT – IV
No
S of Propos Actual Perta Tax Mode of
Topics peri
No ed Date in ing on delivery
ods date CO om
y
lev
el
Chalk &Board
1 Introducing Hadoop 1 CO2 K4
Chalk &Board
Hadoop Overview,
2 RDBMS versus Hadoop 1 CO2 K4
Chalk &Board
HDFS (Hadoop
3 Distributed File 1 K4
CO2
System)
Components and Block
Chalk &Board
4 Replication 1 CO4 K4
CORRECTIVE MEASURES :
Statement Sentiment
4.INTRODUCING HADOOP :
Hadoop Architecture:
At its core, Hadoop has two major layers namely :
1.Processing/Computation layer (MapReduce), and
2.Storage layer (Hadoop Distributed File System).
Hadoop Architecture
Hadoop:
S.No
RDBMS Hadoop
.
Traditional row-column based An open-source software
databases, basically used for used for storing data and
1.
data storage, manipulation running applications or
and retrieval. processes concurrently.
In this both structured and
In this structured data is
2. unstructured data is
mostly processed.
processed.
It is best suited for OLTP It is best suited for BIG
3.
environment. data.
It is less scalable than
4. It is highly scalable.
Hadoop.
Data normalization is Data normalization is not
5.
required in RDBMS. required in Hadoop.
It stores transformed and It stores huge volume of
6.
aggregated data. data.
It has some latency in
7. It has no latency in response.
response.
The data schema of RDBMS The data schema of
8.
is static type. Hadoop is dynamic type.
Low data integrity available
9. High data integrity available.
than RDBMS.
Cost is applicable for licensed Free of cost, as it is an
10.
software. open source software.
4.4 HDFS:COMPONENTS AND BLOCK REPLICATION:
It is difficult to maintain huge volumes of data in a single
machine. Therefore, it becomes necessary to break down the data into
smaller chunks and store it on multiple machines.
File systems that manage the storage across a network of
machines are called distributed file systems.
Hadoop Distributed File System (HDFS) is the storage component
of Hadoop. All data stored on Hadoop is stored in a distributed manner
across a cluster of machines. But it has a few properties that define its
existence.
Huge volumes – Being a distributed file system, it is highly capable of
storing petabytes of data without any glitches.
Data access – It is based on the philosophy that “the most effective data
processing pattern is write-once, the read-many-times pattern”.
Cost-effective – HDFS runs on a cluster of commodity hardware. These are
inexpensive machines that can be bought from any vendor.
components of the Hadoop Distributed File System(HDFS)?
HDFS has two main components, broadly speaking, – data blocks
and nodes storing those data blocks. But there is more to it than meets the
eye. So, let’s look at this one by one to get a better understanding.
HDFS Blocks:
HDFS breaks down a file into smaller units. Each of these units is
stored on different machines in the cluster. This, however, is transparent to
the user working on HDFS. To them, it seems like storing all the data onto a
single machine.
These smaller units are the blocks in HDFS. The size of each of
these blocks is 128MB by default, you can easily change it according to
requirement. So, if you had a file of size 512MB, it would be divided into 4
blocks storing 128MB each.
Datanodes in HDFS:
Datanodes are the worker nodes. They are inexpensive
commodity hardware that can be easily added to the cluster.
Datanodes are responsible for storing, retrieving, replicating,
deletion, etc. of blocks when asked by the Namenode.
They periodically send heartbeats to the Namenode so that it is
aware of their health. With that, a DataNode also sends a list of blocks that
are stored on it so that the Namenode can maintain the mapping of blocks
to Datanodes in its memory.
But in addition to these two types of nodes in the cluster, there is
also another node called the Secondary Namenode. Let’s look at what that
is.
Secondary Namenode in HDFS:
Secondary Namenode:
Replication of blocks:
HDFS is a reliable storage component of Hadoop. This is because
every block stored in the file system is replicated on different Data Nodes in
the cluster. This makes HDFS fault-tolerant.
The default replication factor in HDFS is 3. This means that every
block will have two more copies of it, each stored on separate DataNodes in
the cluster. However, this number is configurable
key/value pairs output by the Map operation are combined to provide the
results.
A Map Reduce application is envisioned as a series of basic operations
applied in a sequence to small sets of many (millions, billions, or even
more) data items. These data items are logically organized in a way that
enables the MapReduce execution model to allocate tasks that can be
executed in parallel
The data items are indexed using a defined key in to key, value
pairs, in which the key represents some grouping criterion associated with
a computed value. With some applications applied to massive datasets, the
theory is that the computations applied during the Map phase to each
input key/value pair are independent from one another. Figure 1.7 shows
how Map and Reduce work.
A SIMPLE EXAMPLE
In the canonical MapReduce example of counting the number of
occurrences of a word across a corpus of many documents, the key is the
word and the value is the number of times the word is counted at each
process node.
The process can be subdivided into much smaller sets of tasks. For
example: The total number of occurrences of each word in the entire collection
of documents is equal to the sum of the occurrences of each word in each
document. The total number of occurrences of each word in each document
can be computed as the sum of the occurrences of each word in each
paragraph.
The total number of occurrences of each word in each paragraph
can be computed as the sum of the occurrences of each word in each
sentence. In this example, the determination of the right level of
parallelism can be scaled in relation to the size of the ―chunk‖ to be processed
and the number of computing resources available in the pool.
A single task might consist of counting the number of occurrences of
Reduce, in which the interim results are accumulated into a final result.
Output result, where the final output is sorted.
These steps are presumed to be run in sequence, and
applications developed using Map Reduce often execute a series of iterations
of the sequence, in which the output results from iteration n becomes the
input to iteration n+1.
Illustration of MapReduce
The simplest illustration of MapReduce is a word count example
in which the task is to simply count the number of times each word
appears in a collection of documents.
Each map task processes a fragment of the text, line by line,
parses a line into words, and emits <word, 1> for each word,
regardless of how many times word appears in the line of text.
In this example, the map step parses the provided text string
into individual words and emits a set of key/value pairs of the form
<word, 1>. For each unique key—in this example, word—the reduce step
sums the 1 values and outputs the <word, count> key/value pairs. Because
the word each appeared twice in the given line of text, the reduce step
provides acorresponding key/value pair of <each, 2>.
Example of how map and reduce works
It should be noted that, in this example, the original key, 1234, is
ignored in the processing. In a typical word count application, the map step
may be applied to millions of lines of text, and the reduce step will summarize
the key/value pairs generated by all the map steps.
.
7. Parallel Programming:
One of the major aspects of the working of MapReduce
programming is its parallel processing. It divides the tasks in a manner that
allows their execution in parallel.
The parallel processing allows multiple processors to execute
these divided tasks. So the entire program is run in less time.
8. Availability and resilient nature:
Whenever the data is sent to an individual node, the same set of
data is forwarded to some other nodes in a cluster. So, if any particular node
suffers from a failure, then there are always other copies present on other
nodes that can still be accessed whenever needed. This assures high
availability of data.
One of the major features offered by Apache Hadoop is its fault
tolerance. The Hadoop MapReduce framework has the ability to quickly
recognizing faults that occur.
It then applies a quick and automatic recovery solution. This
feature makes it a game-changer in the world of big data processing.
.
4.6 INTRODUCTION TO NoSQL:
NoSQL (Not only Structured Query Language) is a term used to
describe those data stores that are applied to unstructured data. As described
earlier, HBase is such a tool that is ideal for storing key/values in column
families. In general, the power of NoSQL data stores is that as the size of the
data grows, the implemented solution can scale by simply adding additional
machines to the distributed system. Four major categories of NoSQL tools and
a few examples are provided next .
Key/value stores contain data (the value) that can be simply
accessed by a given identifier (the key). As described in the MapReduce
discussion, the values can be complex. In a key/value store, there is no stored
structure of how to use the data; the client that reads and writes to a
key/value store needs to maintain and utilize the logic of how to meaningfully
extract the useful elements from the key and the value. Here are some uses
for key/value stores:
• Using a customer’s login ID as the key, the value contains the customer’s
preferences
• Using a web session ID as the key, the value contains everything that was
captured during the session
Column family stores are useful for sparse datasets, records with
thousands of columns but only a few columns have entries. The key/value
concept still applies, but in this case a key is associated with a collection of
columns. In this collection, related columns are grouped into column families.
.
For example, columns for age, gender, income, and education may be
grouped into a demographic family. Column family data stores are useful in
the following instances:
• To store and render blog entries, tags, and viewers’ feedback
• To store and update various web page metrics and counters
Graph databases are intended for use cases such as networks, where there
are item (people or web page links) and relationships between these items.
While it is possible to store graphs such as trees in a relational database, it
often becomes cumbersome to navigate, scale, and add new relationships.
Graph databases help to overcome these possible obstacles and can be
optimized to quickly traverse a graph (move from one item in the network to
another item in the network). Following are examples of graph database
implementations:
• Social networks such as Facebook and LinkedIn.
• Geospatial applications such as delivery and traffic systems to optimize
the time to reach one or more destinations.
Table provides a few examples of NoSQL data stores. As is often the case,
the choice of a specific data store should be made based on the functional
and performance requirements. A particular data store may provide
exceptional functionality in one aspect, but that functionality may come at a
loss of other functionality or performance.
RDBMS MongoDB
2
Database Database
Table Collection
Tuple/Row Document
column Field
mysqld/Oracle mongod
mysql/sqlplus mongo
}
}
Normalized Data Model
In this model, you can refer the sub documents in the original
document, using references. For example, you can re-write the above
document in the normalized model as:
Employee:
{ _id:
<ObjectId101>,
Emp_ID: "10025AE336"
}
Personal_details:
{ _id:
<ObjectId102>,
empDocID: " ObjectId101",
First_Name: "Radhika",
Last_Name: "Sharma",
Date_Of_Birth: "1995-09-26“
}
.
Contact:
{ _id:
<ObjectId103>,
empDocID: " ObjectId101",
e-mail: "[email protected]",
phone: "9848022338"
}
Address:
{ _id:
<ObjectId104>,
empDocID: " ObjectId101",
city: "Hyderabad",
Area: "Madapur",
State: "Telangana" }
The following table defines how Hive interacts with Hadoop framework:
Ste Operation
p
No.
1 Execute Query-
The Hive interface such as Command Line or Web UI sends query to Driver
(any databasedriver such as JDBC, ODBC, etc.) to execute.
2 Get Plan-
The driver takes the help of query compiler that parses the query to check
the syntax andquery plan or the requirement of query.
3 Get Metadata
The compiler checks the requirement and resends the plan to the driver.
Up to here, theparsing and compiling of a query is complete.
6 Execute Plan-
The driver sends the execute plan to the execution engine.
7 Execute Job-
Internally, the process of execution job is a MapReduce job. The execution
engine sends the job to JobTracker, which is in Name node and it assigns
this job to TaskTracker, whichis in Data node. Here, the query executes
MapReduce job.
7(i) Metadata Ops-
Meanwhile in execution, the execution engine can execute
metadata operations withMetastore.
8 Fetch Result-
The execution engine receives the results from Data nodes.
9 Send Results-
The execution engine sends those resultant values to the driver.
10 Send Results-
The driver sends the results to Hive Interfaces.
LOAD DATA
hive> LOAD DATA LOCAL INPATH './usr/Desktop/kv1.txt' OVERWRITE INTO
TABLE Employee;
SELECTS and FILTERS
hive> SELECT E.EMP_ID FROM Employee E WHERE E.Address='US';
GROUP BY
hive> hive> SELECT E.EMP_ID FROM Employee E GROUP BY E.Addresss;
Adding a Partition
We can add partitions to a table by altering the table. Let us
assume we have a table called employee with fields such as Id, Name,
Salary, Designation, Dept, and yoj.
Syntax:
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec
[LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...;
partition_spec:
: (p_column = p_col_value, p_column = p_col_value, ...)
The following query is used to add a partition to the employee table.
hive> ALTER TABLE employee
ADD PARTITION (year=’2012’)
location '/2012/part2012';
Renaming a Partition
The syntax of this command is as follows
ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION
partition_spec;
The following query is used to rename a partition:
hive> ALTER TABLE employee PARTITION (year=’1203’)
RENAME TO PARTITION (Yoj=’1203’);
Dropping a Partition
The following syntax is used to drop a partition
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec,
PARTITION partition_spec,...;
The following query is used to drop a partition:
hive> ALTER TABLE employee DROP [IF EXISTS]
PARTITION (year=’1203’);
SELECT statement with WHERE clause.
SELECT statement is used to retrieve the data from a table. WHERE
clause works similar to a condition. It filters the data using the condition and
gives you a finite result. The built-in operators and functions generate an
expression, which fulfils the condition.
Given below is the syntax of the SELECT query:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING
having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT
number];
Example
Let us take an example for SELECT…WHERE clause. Assume we have
the employee table as given below, with fields named Id, Name, Salary,
Designation, and Dept. Generate a query to retrieve the employee details who
earn a salary of more than Rs 30000
+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1201 | Gopal | 45000 | Technical manager | TP
|
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+ + + + + +
hive> SELECT * FROM employee WHERE salary>30000;
On successful execution of the query, you get to see the following response:
+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+ + + + +
Part-A Questions and Answers
CO4 K2
8 Explain the process of Sharding in MongoDB.
Differentiate between traditional distributed CO4 K2
9 management processing and MongoDB.
Q. Question CO K Level
No. Level
2 Anand Rajaraman and Jeffrey David Ullman, "Mining of Massive Text Book
Datasets", Cambridge University Press, 2012.
http://infolab.stanford.edu/~ullman/mmds/bookL.pdf
6 Jimmy Lin and Chris Dyer, "Data-Intensive Text Processing with Reference
MapReduce", Synthesis Lectures on Human Language Book
Technologies, Vol. 3, No. 1, Pages 1-177, Morgan Claypool
publishers, 2010.
Mini Project Suggestions
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
relianceon the contentsof this information is strictlyprohibited.