20ai402 Data Analytics Unit-2
20ai402 Data Analytics Unit-2
proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document contains
proprietary information and is intended only to the respective group / learning
community as intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately by e-mail
if you have received this document by mistake and delete this document from
your system. If you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.
20AI402
DATA ANALYTICS
Department: CSE
Batch/Year: 2020-2024 /IV YEAR
Created by:
Ms.Sajithra S / Asst.Professor
Ms. Gayathri.S / Asst.Professor
Date: 03-08-2023
1.Table of Contents
1. Contents 5
2. Course Objectives 6
3. Pre-Requisites 7
4. Syllabus 8
5. Course outcomes 9
7. Lecture Plan 13
9. Lecture notes 16
10. Assignments 53
12. Part B Qs 62
16 Assessment Schedule 69
Semester-V1
Data Science
Fundamentals
Sem ester-II
Python Programming
Se mester-I
C Programming
4.SYLLABUS
20AI402 DATA ANALYTICS LTPC
3003
UNIT I INTRODUCTION 9
UNIT V APPLICATIONS 9
Application: Sales and Marketing – Industry Specific Data Mining – microRNA Data
Analysis Case Study – Credit Scoring Case Study – Data Mining Nontabular Data.
TOTAL: 45 PERIODS
5.COURSE OUTCOMES
2 3 3 3 3 1 1 1 1 2 3 3
3 3 3 3 3 3 3 3 3 2 3 3
4 3 3 3 3 3 3 3 3 2 3 3
5 3 3 3 3 3 3 3 3 2 3 3
Lecture Plan
Unit - II
LECTURE PLAN – Unit 2- HADOOP FRAMEWORK
RDBMS versus
25/08/2023
PPT/Chalk &
2 Hadoop 1 2 K6
Talk
28/08/2023
PPT/Chalk &
3 Hadoop Overview 1 2 K6
Talk
Processing
02/09/2023
PPT/Chalk &
5 Data with Hadoop 1 2 K6
Talk
Managing 04/09/2023
Resources and PPT/Chalk &
6 Applications with 1 2 K6
Talk
Hadoop YARN
08/09/2023
Interacting PPT/Chalk &
7 2 2 K6
with Hadoop Talk
Ecosystem
8. ACTIVITY BASED LEARNING
Guidelines to do an activity :
3)Conduct Peer review. ( each team will be reviewed by all other teams and
mentors )
Useful link:
https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm
UNIT-II
HADOOP FRAMEWORK
9.LECTURE NOTES
1. INTRODUCING HADOOP
Today, Big Data seems to be the buzz word! Enterprises, the world over, are
beginning to realize that there is a huge volume of untapped information before
them in the form of structured, semi-structured and unstructured data. This varied
variety of data is spread across the networks.
Let us look at few statistics to get an idea of the amount of data which gets
generated every day, every minute and every second.
1. Every day:
(a) NYSE (New York Stock Exchange) generates 1.5 billion shares and trade data.
(b) Facebook stores 2.7 billion comments and Likes.
(c) Google processes about 24 petabytes of data.
2. Every minute:
(a) Facebook users share nearly 2.5 million pieces of content.
(b) Twitter users tweet nearly 300,000 times.
(c) Instagram users post nearly 220,000 new photos.
(d) YouTube users upload 72 hours of new video content.
(e) Apple users download nearly 50,000 apps.
(f) Email users send over 200 million messages.
(g) Amazon generates over $80,000 in online sales.
(h) Google receives over 4 million search queries.
3. Every second:
(a) Banking applications process more than 10,000 credit card transactions.
1.1 Data : The Treasure Trove
1. Provides business advantages such as generating product recommendations,
inventing new products, analyzing the market, and many, many more,..
2. Provides few early key indicators that can turn the fortune of business.
3.Provides room for precise analysis. If we have more data for analysis, then we
have greater precision of analysis.
To process, analyze, and make sense of these different kinds of data, we need a
system that scales and addresses the challenges shown in the below Figure.
etc.
Processing
any consistent
relationships between
data.
Processor Needs expensive In a Hadoop Cluster, a
drives.
Cost Cost around $10,000 to Cost around $4,000 per
$14,000 per terabytes of
terabytes of storage.
storage
3. HADOOP OVERVIEW
Open-source software framework to store and process massive amounts of data
in a distributed fashion on large clusters of commodity hardware. Basically,
2. MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
Hadoop Ecosystem:
Hadoop Ecosystem are support projects to enhance the functionality of Hadoop
Core Components.
The Eco Projects are as follows:
1. HIVE
2. PIG
3. SQOOP
4. HBASE
5. FLUME
6. OOZIE
7. MAHOUT
It is conceptually divided into Data Storage Layer which stores huge volumes of
data and Data Processing Layer which processes data in parallel to extract
Figure 5.14 describes important key points of HDFS. Figure 5.15 describes
Hadoop Distributed File system architecture. Client Application interacts with
NameNode for metadata related activities and communicates with DataNodes to
read and write files. DataNodes converse with each other for pipeline reads and
writes.
Let us assume that the file “Sample.txt” is of size 192 MB. As per the default data
block size (64 MB), it will be split into three blocks and replicated across the
nodes on the cluster based on the default replication factor.
1. HDFS Daemons
1. NameNode
➢ HDFS breaks a large file into smaller pieces called blocks. NameNode uses
a rack ID to identify DataNodes in the rack.
➢ A rack is a collection of DataNodes within the cluster. NameNode keeps
tracks of blocks of a file as it is placed on various DataNodes.
➢ NameNode manages file-related operations such as read, write, create and
delete. Its main job is managing the File System Namespace.
➢ A file system namespace is collection of files in the cluster. NameNode
stores HDFS namespace.
PICTURE THIS …
You work for a renowned IT organization. Everyday when you come to office, you
are required to swipe in to record your attendance. This record of attendance is
then shared with your manager to keep him posted on who all from his team
have reported for work. Your manager is able to allocate tasks to the team
members who are present in office. The tasks for the day cannot be allocated to
team member who have not turned in. Likewise heartbeat report is a way by
which DataNodes inform the NameNode that they are up and functional and can
be assigned tasks. Figure 5.17 depicts the above scenario.
4.1.3 Secondary NameNode
➢ The Secondary NameNode takes a snapshot of HDFS metadata at intervals
specified in the Hadoop configuration. Since the memory requirements of
Secondary NameNode are the same as NameNode, it is better to run
NameNode and Secondary NameNode on different machines.
➢ In case of failure of the NameNode, the Secondary NameNode can be
configured manually to bring up the cluster.
➢ However, the Secondary NameNode does not record any real-time changes
that happen to the HDFS metadata.
4. Client calls read() repeatedly to stream the data from the DataNode.
5. When end of the block is reached, DFSlnputStream closes the connection
with the DataNode. It repeats the steps to find the best DataNode for the
6. When the client finishes writing the file, it calls close() on the stream.
7.This flushes all the remaining packets to the DataNode pipeline and waits for
relevant acknowledgments before communicating with the NameNode to inform
the client that the creation of the file is complete.
Act:
hadoop fs -copyFromLocal /root/sample/test.txt /sample/testsample.txt
➢ Objective: To copy a file from Hadoop file system to local file system via
copyToLocal command.
Act:
hadoop fs -copyToLocal /sample/test.txt /root/sample/testsample1.txt
1. Data Replication:
There is absolutely no need for a client application to track all blocks. It directs
the client to the nearest replica to ensure high performance.
2. Data Pipeline:
A client application writes a block to the first DataNode in the pipeline. Then this
DataNode takes over and forwards the data to the next node in the pipeline. This
process continues for all the data blocks, and subsequently all the replicas are
written to the disk.
5. Processing Data with Hadoop
• The output produced by the map tasks serves as intermediate data and is
stored on the local disk of that server. The output of the mappers are
automatically shuffled and sorted by the framework.
• Job inputs and outputs are stored in a file system. MapReduce framework
also takes care of the other tasks such as scheduling, monitoring,
re-executing failed tasks, etc.
• The application and the job parameters together are known as job
configuration. Hadoop job client submits job (jar/executable, etc.) to
the JobTracker. Then it is the responsibility of JobTracker to schedule tasks
to the slaves. In addition to scheduling, it also monitors the task and
provides status information to the job-client.
• Once the client submits a job to the JobTracker, it partitions and assigns
diverse MapReduce tasks for each TaskTracker in the cluster. Figure 5.22
• In this example, there are two mappers and one reducer. Each mapper
works on the partial dataset that is stored on that node and the reducer
combines the output from the mappers to produce the reduced result set.
1.First, the input dataset is split into multiple pieces of data (several small
subsets).
2. Next, the framework creates a master and several workers processes
and executes the worker processes remotely.
3.Several map tasks work simultaneously and read pieces of data that were
assigned to each map task. The map worker uses the map function to
extract only those data that are present on their server and generates
key/value pair for the extracted data.
4.Map worker uses partitioner function to divide the data into regions.
Partitioner decides which reducer should get the output of the specified
mapper.
5. When the map workers complete their work, the master instructs the reduce
workers to begin their work. The reduce workers in turn contact the map workers
to get the key/value data for their partition. The data thus received is shuffled
and sorted as per keys.
6.Then it calls reduce function for every unique key. This function writes the output
to the file
7.When all the reduce workers complete their work, the master transfers the
control to the user program.
4. Hadoop 1.0 is not suitable for machine learning algorithms, graphs, and other
memory intensive algorithms.
• In this Architecture, map slots might be "full", while the reduce slots
are empty and vice versa.
• This causes resource utilization issues. This needs to be improved for
proper resource utilization.
6.2 HDFS Limitation
• NameNode saves all its file metadata in main memory. Although the
main memory today is not as small and as expensive as it used to be
two decades ago, still there is a limit on the number of objects that one
can have in the memory on a single NameNode.
• The NameNode can quickly become overwhelmed with load on the
system increasing.
3. Hadoop 2: HDFS
• HDFS 2 consists of two major components:
(a) namespace
(b) blocks storage service.
• Namespace service takes care of file-related operations, such as creating
files, modifying files, and directories. The block storage service handles
data node cluster management, replication.
HDFS 2 Features
1. Horizontal scalability
2. High availability
2. NodeManager:
• This is a per-machine slave daemon.
• NodeManager responsibility is launching the application containers for
application execution.
• NodeManager monitors the resource usage such as memory, CPU, disk,
network, etc. It then reports the usage of resources to the global
ResourceManager.
3. Per-application ApplicationMaster:
• This is an application-specific entity.
• Its responsibility is to negotiate required resources for execution from the
ResourceManager.
• It works along with the NodeManager for executing and monitoring
component tasks.
Container:
1. Basic unit of allocation.
2.Fine-grained resource allocation across multiple resource types (Memory, CPU,
disk, network, etc.)
YARN Architecture:
Figure 5.29 depicts YARN architecture.
7. During the application execution, the client that submitted the job directly
communicates with the ApplicationMaster to get status, progress updates, etc. via
an application-specific protocol.
1. Pig
• Pig is a data flow system for Hadoop. It uses Pig Latin to specify data flow.
• Pig is an alternative to MapReduce Programming. It abstracts some details
and allows us to focus on data processing.
• It consists of two components.
1. Pig Latin: The data processing language.
2. Compiler: To translate Pig Latin to MapReduce Programming.
• Figure 5.30 depicts the Pig in the Hadoop ecosystem.
7.2 Hive
• Hive is a Data Warehousing Layer on top of Hadoop.
• Analysis and queries can be done using an SQL-like language.
• Hive can be used to do ad-hoc queries, summarization, and data analysis.
Figure 5.31 depicts Hive in the Hadoop ecosystem.
7.3 Sqoop
• Sqoop is a tool which helps to transfer data between Hadoop and
Relational Databases.
• With the help of Sqoop, we can import data from RDBMS to HDFS and
vice-versa. Figure 5.32 depicts the Sqoop in Hadoop ecosystem.
7.4 HBase
• HBase is a NoSQL database for Hadoop.
• HBase is column-oriented NoSQL database.
• HBase is used to store billions of rows and millions of columns.
• HBase provides random read/write operation. It also supports record level
updates which is not possible using HDFS. HBase sits on top of HDFS.
Figure 5.33 depicts the HBase in Hadoop ecosystem.
VIDEO LINKS
Unit – II
VIDEO LINKS
Sl. Topic Video Link
No.
1 Introducing Hadoop https://www.youtube.com/watch?v=aReuLtY0YMI
1. Since the data is replicated thrice in HDFS, does it mean that any
calculation done on one node will also be replicated on the other two ?
2. Why do we use HDFS for applications having large datasets and not
when we have small files?
3. Suppose Hadoop spawned 100 tasks for a job and one of the task
failed. What will hadoop do ?
2. MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
11. PART A : Q & A : UNIT – II
4.Hadoop 1.0 is not suitable for machine learning algorithms, graphs, and other
memory intensive algorithms.
11. PART A : Q & A : UNIT – II
14. What is the difference between SQL and MapReduce. ( CO2, K1)
15. What are the three important classes of MapReduce? ( CO2, K1)
● MapReduce Programming requires three things.
1. Driver Class: This class specifies Job Configuration details.
2.Mapper Class: This class overrides the Map Function based on the problem
statement.
3.Reducer Class: This class overrides the Reduce Function based on the problem
statement.
11. PART A : Q & A : UNIT – II
17. Which daemon is responsible for executing overall MapReduce job? ( CO2, K1)
• JobTracker provides connectivity between Hadoop and our application.
When we submit code to cluster, JobTracker creates the execution plan by
deciding which task to assign to which node. lt also monitors all the running
tasks.
• JobTracker is a master daemon responsible for executing overall MapReduce
job.
2. Data Pipeline:
A client application writes a block to the first DataNode in the pipeline. Then this
DataNode takes over and forwards the data to the next node in the pipeline. This
process continues for all the data blocks, and subsequently all the replicas are
written to the disk.
11. PART A : Q & A : UNIT – II
6. Explain Managing Resources and Applications with Hadoop YARN. ( CO2, K1)
7. Write the Word Count MapReduce Programming using Java. ( CO2, K1)
8. How Does MapReduce Work? ( CO2, K1)
9. Write short notes on Working with HDFS Commands. ( CO2, K1)
13. SUPPORTIVE ONLINE CERTIFICATION COURSES
NPTEL : https://nptel.ac.in/courses/106104189
coursera : https://www.coursera.org/learn/hadoop
Udemy
https://www.udemy.com/course/the-ultimate-hands-on-hadoop-tame-your-big-data/
Mooc : https://www.mooc-list.com/tags/hadoop
edx : https://www.edx.org/learn/hadoop
14. REAL TIME APPLICATIONS
1. Finance sectors
Financial organizations use hadoop for fraud detection and prevention. They use
Apache Hadoop for reducing risk, identifying rogue traders, analyzing fraud patterns.
Hadoop helps them to precisely target their marketing campaigns on the basis of
customer segmentation.
The USA national security agency uses Hadoop in order to prevent terrorist attacks
and to detect and prevent cyber-attacks. Big Data tools are used by the Police forces
for catching criminals and even predicting criminal activity. Hadoop is used by
different public sector fields such as defense, intelligence, research, cybersecurity,
etc.
Retailers both online and offline use Hadoop for improving their sales. Many
e-commerce companies use Hadoop for keeping track of the products bought
together by the customers. On the basis of this, they provide suggestions to the
customer to buy the other product when the customer is trying to buy one of the
relevant products from that group.
For example, when a customer tries to buy a mobile phone, then it suggests a
customer for the mobile back cover, screen guard.
Also, Hadoop helps retailers to customize their stocks based on the predictions that
came from different sources such as Google search, social media websites, etc. Based
on these predictions retailers can make the best decision which helps them to
improve their business and maximize their profits.
Hadoop can analyze customer data in real-time. It can track clickstream data as it’s
for storing and processing high volumes of clickstream data. When a visitor visits a
website, then Hadoop can capture information like from where the visitor originated
before reaching a particular website, the search used for landing on the website.
Hadoop can also grab data about the other webpages in which the visitor shows
interest, time spent by the visitor on each page, etc. This is the analysis of website
performance and user engagement.
15. CONTENTS BEYOND SYLLABUS : UNIT – II
Apache Spark
Apache Spark is an open-source, distributed processing system used for big data
workloads. It utilizes in-memory caching, and optimized query execution for fast
analytic queries against data of any size. It provides development APIs in Java,
Scala, Python and R, and supports code reuse across multiple workloads—batch
processing, interactive queries, real-time analytics, machine learning, and graph
processing.
Spark Core is the base of the whole project. It provides distributed task dispatching,
scheduling, and basic I/O functionalities. Spark uses a specialized fundamental data
structure known as RDD (Resilient Distributed Datasets) that is a logical collection of
data partitioned across machines.
RDDs can be created in two ways; one is by referencing datasets in external storage
systems and second is by applying transformations (e.g. map, filter, reducer, join)
on existing RDDs.
Spark Shell
TEXT BOOKS:
1.Subhashini Chellappan, Seema Acharya, “Big Data and Analytics”, 2nd edition,
Wiley Publications, 2019.
2.Suresh Kumar Mukhiya and Usman Ahmed, “Hands-on Exploratory Data Analysis
with Python”, Packt publishing, March 2020.
3.Jure Leskovek, Anand Rajaraman and Jefrey Ullman,” Mining of Massive Datasets.
v2.1”, Cambridge University Press,2019.
4.Glenn J. Myatt, Wayne P. Johnson, Making Sense of Data II : A Practical Guide To
Data Visualization, Advanced Data Mining Methods, and Applications, Wiley 2009.
REFERENCES:
1.Nelli, F., Python Data Analytics: with Pandas, NumPy and Matplotlib, Apress,
2018.
2.Bart Baesens," Analytics in a Big Data World: The Essential Guide to Data Science
and its Applications", John Wiley & Sons, 2014
3.Min Chen, Shiwen Mao, Yin Zhang, Victor CM Leung, Big Data: Related
Technologies, Challenges and Future Prospects, Springer, 2014.
4.Michael Minelli, Michele Chambers, Ambiga Dhiraj, “Big Data, Big Analytics:
Emerging Business Intelligence and Analytic Trends”, John Wiley & Sons, 2013.
5.Marcello Trovati, Richard Hill, Ashiq Anjum, Shao Ying Zhu, “Big Data Analytics
and cloud computing – Theory, Algorithms and Applications”, Springer
International Publishing, 2016
18. MINI PROJECT SUGGESTION
Mini Project:
The project should contain the following components
• Realtime dataset
• Data preparation & Transformation
• Handling missing Data
• Data Storage
• Algorithm for data analytics
• Data visualization: Charts, Heatmap, Crosstab, Treemap
Thank you
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the respective
group / learning community as intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately by e-mail if you have received
this document by mistake and delete this document from your system. If you are not the intended
recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the
contents of this information is strictly prohibited.