11 Lecture

Uploaded by

zartasha574

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

11 Lecture

Uploaded by

zartasha574

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Hadoop

LECTURE 11
1
Zaeem Anwaar
Assistant Director IT
2 Introduction to Hadoop
History starts from 21 Century (i.e. 2001-4 majorly)
When internet started to be popular or users were increasing on daily basis
Before that data was in lesser amount and used to be stored in rows and columns (Mostly
data were Documents)
Structured data – Relational Data (Rows and Columns) storage was easy due to less
volume/amount/size
Single Processing was used and it was easier for structured data
When change in data happened, type of data changed at initial basis.
Data was semi and unstructured Types (emails, audio, video, images, text etc.)
Single storage and processing unit was impossible to do all the job
Big data Revolutionization (blogs, social media)
3 Big Data
Collection of datasets that are large and complex, that it becomes difficult to store,
maintain, access, process and visualize using on-hand database management system or
traditional data processing applications.
To classify data into big data 5 V’s :
Volume (data should be in form of huge datasets e.g. petabytes (1,024 terabytes (TB)
in a petabyte))
Variety formats (Structured (rows and columns), semi-structured (.CSV, .XML) and
unstructured (audio, video, text, etc.))
Velocity (new generation of data at alarming rate) i.e. IoT, blogs, banks, Social media
Value (finding correct meaning-full data)
Veracity (uncertainty and inconsistency of data-Normalization e.g. missing data or
NAN values)
4 Distributed Computing:

A distributed computer system consists of multiple software components that are on

multiple computers, but run as a single system. The computers that are in a distributed
system can be physically close together and connected by a local network, or they can be
geographically distant and connected by a wide area network.
5 Hadoop Definition:
Doug Cutting (father of big data/Hadoop) and Mike Cafarella (2002)
How large amount of data (big data) can be processed and stored.
2008 yahoo declared ‘Hadoop project’ as opensource.
Apache Hadoop in 2012 made it available freely for public.
Apache Hadoop software is an open source framework (not a software) that allows us to
store and access in a parallel and distributed manner (installed) of large datasets across
clusters of computers using simple programming models (C ++, Java (written initially),
Python)
Two main Components/Layers of Hadoop Architecture :
HDFS (Hadoop Distributed File System) ----- Storage Layer
MapReduce ---- Processing/Computation Layer/Retrieve important data
6 Example from real time:
7
8
9
Local Local
HDFS: Name Node vs Data Node Disk Disk
10
Primary Data Nodes/ Slave Nodes

One block : Data node 1 Data node 2 Data node 3 Data node 4
128 MB (by default)

128 MB (by default) 128 MB (by default)

128 MB (by default)

META-DATA
HDFS 512 MB
Name Node File Name, size
Replication Name Node/ and location
Master Node/
Secondary Name Nodes Boss Node
11 Goals of HDFS
Fault detection and recovery
Huge datasets
HDFS should have hundreds of nodes per cluster to manage the applications
having huge datasets.
Hardware at data
A requested task can be done efficiently, when the computation takes place
near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput. (Some times data is saved
locally to avoid traffic and bandwidth scenarios)
12 MapReduce
To access the data (processing element of Hadoop)
Traditional Data processing was done on single machine having a single processor
consuming more time and it was inefficient specially when processing large variety and
volume of data
Access the data by using distributed and parallel computation algorithms and by dividing it
in parts according to the query done by the user (divide and conquer)
Map Reduce Example:
13
14
How does Hadoop work?
15
Hadoop runs code across a cluster of computers. This process includes the
following core tasks that Hadoop performs −
Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further processing.
HDFS, being on top of the local file system, supervises the processing.
Master Slave Node Concept
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
16 Why do we need Hadoop?
Apache Hadoop was born out of a need to more quickly and reliably process for a
big data.
Instead of using one large computer to store and process data, Hadoop uses
clusters of multiple computers to analyze massive datasets in parallel.
Hadoop can handle various forms of structured and unstructured data, which gives
companies
greater speed
flexibility for collecting, storing and accessing
Processing
Analyzing big data
17 What is Apache Hadoop used for/Examples?
Analytics and big data
Many companies and organizations use Hadoop for research, data processing, and analytics that require
processing terabytes or petabytes of big data, storing diverse datasets, and data parallel processing.
Vertical industries
Companies i.e., technology, education, healthcare, rely on Hadoop for tasks that share a common theme of
high variety, volume, and velocity of structured and unstructured data.
AI and machine learning
Development of artificial intelligence and machine learning applications by using computational algorithms
in MapReduce.
Cloud computing
Companies often choose to run Hadoop clusters on public, private, or hybrid cloud resources versus
on-premises hardware to gain flexibility, availability, and cost control. Many cloud solution providers offer
fully managed services for Hadoop, such as Dataproc from Google Cloud.
https://cloud.google.com/dataproc
18 Hadoop: A Gamechanger
Facebook Data
IBM Data
eBay Data
Amazon Data
Applications of Hadoop:
Data Warehousing
Recommendation systems
Fraud detections
Sentimental analysis
19
Advantages/Benefits of Hadoop
Computing power
Hadoop has a great computing power due to distributed computing concept/data nodes
Flexibility
Deal with any kind of dataset like Structured, Semi-Structured, Un-structured very
efficiently
Fault tolerance
Data is replicated across a cluster so that it can be recovered easily should disk, node, or
rack failures occur
Cost control
Hadoop is available freely (open source) https://hadoop.apache.org/
Open source framework innovation and Faster time to market
The collective power of an open source community delivers more ideas, quicker
development, and troubleshooting when issues arise, which translates into a faster time
to market. Also compatible to all programming languages
20 Disadvantages
Problem with small files/data
Hadoop can efficiently perform over a small number of files of large size. Hadoop
stores the file in the form of file blocks which are from 128MB in size(by default).
Hadoop fails when it needs to access the small size file/data in a large amount.
Vulnerability
Hadoop is a framework that is written in java, and java is one of the most commonly
used programming languages which makes it more insecure as it can be maybe easily
exploited by any of the cyber-criminal act.
No Continuous Real-time Data Processing
Apache Hadoop is for batch processing, which means it takes a huge amount of data in
input, process it and produces the result.
21 Activity:
Apply MapReduce process step by step.
INPUT

Air Rare Beer Bare

Air Bare Bare Air Air
Rare Beer Rare Air
22 Activity:
Apply MapReduce process step by step.
INPUT

Deer Bear River

Car Car River
Deer Car Bear

Dali Schematics
No ratings yet
Dali Schematics
7 pages
Kenya - How To Bypass Google FRP On TECNO Without Computer - Carlcare
No ratings yet
Kenya - How To Bypass Google FRP On TECNO Without Computer - Carlcare
4 pages
Old Reddit Com R Piracy Comments Km1pkl How To Rip From Amazon Loselessly
No ratings yet
Old Reddit Com R Piracy Comments Km1pkl How To Rip From Amazon Loselessly
20 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Unit_IV_Hadoop
No ratings yet
Unit_IV_Hadoop
90 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
Big Data
No ratings yet
Big Data
29 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
HADOOP
No ratings yet
HADOOP
10 pages
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
No ratings yet
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
5 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
Big Data Introduction PDF
No ratings yet
Big Data Introduction PDF
180 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
BIGDATA
No ratings yet
BIGDATA
180 pages
Big Data
No ratings yet
Big Data
67 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Haddob Lab Report
No ratings yet
Haddob Lab Report
12 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
HADOOP
No ratings yet
HADOOP
19 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
Unit - 3
No ratings yet
Unit - 3
34 pages
Seminar Report PDF
100% (2)
Seminar Report PDF
35 pages
Best Hadoop Online Training
No ratings yet
Best Hadoop Online Training
41 pages
Apache Hadoop Training
No ratings yet
Apache Hadoop Training
377 pages
HADOOP
No ratings yet
HADOOP
55 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
IMTC634_Data Science_Chapter 13
No ratings yet
IMTC634_Data Science_Chapter 13
16 pages
Apache Hadoop Developer Training
100% (1)
Apache Hadoop Developer Training
394 pages
Apache Hadoop Developer Training PDF
No ratings yet
Apache Hadoop Developer Training PDF
394 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
34 pages
Day 2 S1 Intro_to_hadoop_Ashok
No ratings yet
Day 2 S1 Intro_to_hadoop_Ashok
27 pages
BDA Final Notes
No ratings yet
BDA Final Notes
53 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
Hadoop and Big Data
No ratings yet
Hadoop and Big Data
41 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Seven Computer
No ratings yet
Seven Computer
12 pages
Is Success Evaluation Asignment (Group 3)
No ratings yet
Is Success Evaluation Asignment (Group 3)
34 pages
McKesson Lumen Series EasyOne Air Operator's Manual V1.0 en 9000502_00
No ratings yet
McKesson Lumen Series EasyOne Air Operator's Manual V1.0 en 9000502_00
146 pages
ISTQB_CTFL_v4.0_Sample-Exam-C-Questions_v1.5
No ratings yet
ISTQB_CTFL_v4.0_Sample-Exam-C-Questions_v1.5
21 pages
Pippi MushiSystem
No ratings yet
Pippi MushiSystem
13 pages
ACF and PACF in Excel
No ratings yet
ACF and PACF in Excel
11 pages
Soredex Digora DXR-50000 Imaging Plate System - Service Manual
No ratings yet
Soredex Digora DXR-50000 Imaging Plate System - Service Manual
87 pages
MPK Mini Plus - User Guide - v1.2
No ratings yet
MPK Mini Plus - User Guide - v1.2
28 pages
Medical Billing Final Project
No ratings yet
Medical Billing Final Project
293 pages
203 Managing Recipient Objects
No ratings yet
203 Managing Recipient Objects
42 pages
SALV SAMPLE With Button
No ratings yet
SALV SAMPLE With Button
11 pages
Rami Reddy CV
No ratings yet
Rami Reddy CV
8 pages
Nissan X Trail Model t32 Series Service Repair Manual
100% (1)
Nissan X Trail Model t32 Series Service Repair Manual
9,003 pages
Custom Template Fabrik 3
No ratings yet
Custom Template Fabrik 3
4 pages
Dfu Simman 3g Rev M PC
No ratings yet
Dfu Simman 3g Rev M PC
33 pages
Image Based Plant Disease Classification Using Deep Learning
No ratings yet
Image Based Plant Disease Classification Using Deep Learning
50 pages
Mini Project Document
No ratings yet
Mini Project Document
49 pages
OLDI
No ratings yet
OLDI
132 pages
Introduction To Nursing Informatics
No ratings yet
Introduction To Nursing Informatics
16 pages
ADAS Training Systems
100% (1)
ADAS Training Systems
20 pages
Hisense PDP42W39PEU Service Manual
No ratings yet
Hisense PDP42W39PEU Service Manual
52 pages
Department of Computer Engineering Course Code:-22414
100% (6)
Department of Computer Engineering Course Code:-22414
22 pages
Chapter 7
No ratings yet
Chapter 7
31 pages
CH - 14 - Advanced Panel Data Methods
No ratings yet
CH - 14 - Advanced Panel Data Methods
12 pages
05 - Revit Beginner - Lines-Walls
No ratings yet
05 - Revit Beginner - Lines-Walls
17 pages
Python Tutorial_ Interactive Mode
No ratings yet
Python Tutorial_ Interactive Mode
1 page
CS Database Notes
No ratings yet
CS Database Notes
11 pages

11 Lecture

Uploaded by

11 Lecture

Uploaded by

Hadoop

A distributed computer system consists of multiple software components that are on

128 MB (by default) 128 MB (by default)

128 MB (by default)

Air Rare Beer Bare

Deer Bear River

You might also like