11 Lecture
11 Lecture
11 Lecture
LECTURE 11
1
Zaeem Anwaar
Assistant Director IT
2 Introduction to Hadoop
History starts from 21 Century (i.e. 2001-4 majorly)
When internet started to be popular or users were increasing on daily basis
Before that data was in lesser amount and used to be stored in rows and columns (Mostly
data were Documents)
Structured data – Relational Data (Rows and Columns) storage was easy due to less
volume/amount/size
Single Processing was used and it was easier for structured data
When change in data happened, type of data changed at initial basis.
Data was semi and unstructured Types (emails, audio, video, images, text etc.)
Single storage and processing unit was impossible to do all the job
Big data Revolutionization (blogs, social media)
3 Big Data
Collection of datasets that are large and complex, that it becomes difficult to store,
maintain, access, process and visualize using on-hand database management system or
traditional data processing applications.
To classify data into big data 5 V’s :
Volume (data should be in form of huge datasets e.g. petabytes (1,024 terabytes (TB)
in a petabyte))
Variety formats (Structured (rows and columns), semi-structured (.CSV, .XML) and
unstructured (audio, video, text, etc.))
Velocity (new generation of data at alarming rate) i.e. IoT, blogs, banks, Social media
Value (finding correct meaning-full data)
Veracity (uncertainty and inconsistency of data-Normalization e.g. missing data or
NAN values)
4 Distributed Computing:
One block : Data node 1 Data node 2 Data node 3 Data node 4
128 MB (by default)
META-DATA
HDFS 512 MB
Name Node File Name, size
Replication Name Node/ and location
Master Node/
Secondary Name Nodes Boss Node
11 Goals of HDFS
Fault detection and recovery
Huge datasets
HDFS should have hundreds of nodes per cluster to manage the applications
having huge datasets.
Hardware at data
A requested task can be done efficiently, when the computation takes place
near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput. (Some times data is saved
locally to avoid traffic and bandwidth scenarios)
12 MapReduce
To access the data (processing element of Hadoop)
Traditional Data processing was done on single machine having a single processor
consuming more time and it was inefficient specially when processing large variety and
volume of data
Access the data by using distributed and parallel computation algorithms and by dividing it
in parts according to the query done by the user (divide and conquer)
Map Reduce Example:
13
14
How does Hadoop work?
15
Hadoop runs code across a cluster of computers. This process includes the
following core tasks that Hadoop performs −
Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further processing.
HDFS, being on top of the local file system, supervises the processing.
Master Slave Node Concept
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
16 Why do we need Hadoop?
Apache Hadoop was born out of a need to more quickly and reliably process for a
big data.
Instead of using one large computer to store and process data, Hadoop uses
clusters of multiple computers to analyze massive datasets in parallel.
Hadoop can handle various forms of structured and unstructured data, which gives
companies
greater speed
flexibility for collecting, storing and accessing
Processing
Analyzing big data
17 What is Apache Hadoop used for/Examples?
Analytics and big data
Many companies and organizations use Hadoop for research, data processing, and analytics that require
processing terabytes or petabytes of big data, storing diverse datasets, and data parallel processing.
Vertical industries
Companies i.e., technology, education, healthcare, rely on Hadoop for tasks that share a common theme of
high variety, volume, and velocity of structured and unstructured data.
AI and machine learning
Development of artificial intelligence and machine learning applications by using computational algorithms
in MapReduce.
Cloud computing
Companies often choose to run Hadoop clusters on public, private, or hybrid cloud resources versus
on-premises hardware to gain flexibility, availability, and cost control. Many cloud solution providers offer
fully managed services for Hadoop, such as Dataproc from Google Cloud.
https://cloud.google.com/dataproc
18 Hadoop: A Gamechanger
Facebook Data
IBM Data
eBay Data
Amazon Data
Applications of Hadoop:
Data Warehousing
Recommendation systems
Fraud detections
Sentimental analysis
19
Advantages/Benefits of Hadoop
Computing power
Hadoop has a great computing power due to distributed computing concept/data nodes
Flexibility
Deal with any kind of dataset like Structured, Semi-Structured, Un-structured very
efficiently
Fault tolerance
Data is replicated across a cluster so that it can be recovered easily should disk, node, or
rack failures occur
Cost control
Hadoop is available freely (open source) https://hadoop.apache.org/
Open source framework innovation and Faster time to market
The collective power of an open source community delivers more ideas, quicker
development, and troubleshooting when issues arise, which translates into a faster time
to market. Also compatible to all programming languages
20 Disadvantages
Problem with small files/data
Hadoop can efficiently perform over a small number of files of large size. Hadoop
stores the file in the form of file blocks which are from 128MB in size(by default).
Hadoop fails when it needs to access the small size file/data in a large amount.
Vulnerability
Hadoop is a framework that is written in java, and java is one of the most commonly
used programming languages which makes it more insecure as it can be maybe easily
exploited by any of the cyber-criminal act.
No Continuous Real-time Data Processing
Apache Hadoop is for batch processing, which means it takes a huge amount of data in
input, process it and produces the result.
21 Activity:
Apply MapReduce process step by step.
INPUT