0% found this document useful (0 votes)

20 views26 pages

Lect 2 Big Data Lesson01

This lesson introduces the concepts of Big Data and Hadoop. It defines Big Data as large data sets that cannot be processed by traditional software. It describes the three V's of Big Data as volume, velocity, and variety. It also defines the different types of data as unstructured, semi-structured, and structured. Finally, it provides an overview of Hadoop as an open-source framework for distributed storage and processing of large data sets across clusters of computers.

Uploaded by

Paritosh Belekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views26 pages

Lect 2 Big Data Lesson01

Uploaded by

Paritosh Belekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Lesson 1

Objectives
By the end of this lesson, you will be
able to:
 Explain the need for Big Data
 Define the concept of Big Data
 Describe the basics and benefits of
Hadoop

2
Need for Big Data

90% of the data in the world today has been created in last two years alone.
Structured format has some limitations with respect to handling large
quantities of data. Thus, there is a need for perfect mechanism, like Big Data, to
handle these increasing quantities.
Big Data relies on three important aspects of data complexity as explained in
the following image.

3
What is Big Data
Big Data is the term applied to data sets whose size is beyond the ability of
Defining Big Data the commonly used software tools to capture, manage, and process within a
tolerable elapsed time.

● Web logs
● Sensor network
● Social media
● Internet text and documents
● Internet pages
Sources of Big Data ● Search index data
● Atmospheric science, astronomy, biochemical, medical records
● Scientific research
● Military surveillance
● Photography archives

4
Types of Data
Three types of data can be identified:
Unstructured Data
Data which do not have a pre-defined data model
E.g. Text files

Semi-structured Data
Data which do not have a formal data model
E.g. XML files

Structured Data
Data which is represented in a tabular format
E.g. Databases

5
Handling Limitations of Big Data

How to handle system uptime How to combine accumulated

and downtime data from all the systems

● Commodity hardware for data ● Analyzing data across different

storage and analysis machines
● Maintaining a copy of same ● Merging of data

data across clusters

6
Introduction to Hadoop

What is Hadoop? Why Hadoop?

● A free, Java-based ● Runs applications on

programming framework that distributed systems with
supports the processing of thousands of nodes involving
large data sets in a distributed petabytes of data
computing environment ● Distributed file system
● Based on Google File System provides fast data transfers
(GFS) among the nodes

7
History and Milestones of Hadoop

Hadoop originated from Nutch open source project on search engine to work
over distributed network nodes. Yahoo was the first company to make and use
Hadoop as a core part of their system operations. Now Hadoop is a core part in
systems like Facebook, LinkedIn, Twitter, etc.
Hadoop Milestones

8
Organizations Using Hadoop
Name of the
organization Cluster specifications Uses

A9.com: ● To build Amazon's product search indices

Clusters vary from 1 to 100 nodes ● Process millions of sessions daily for analytics
Amazon

More than 100,000 CPUs in approximately

20,000 computers running Hadoop; ● To support research for ad systems and web
Yahoo biggest cluster has 2000 nodes (2*4cpu search
boxes with 4 TB disk each)

Cluster size is 50 machines, Intel Xeon,

dual processors, dual core, each with 16 ● For a variety of functions ranging from data
AOL GB RAM and 800 GB hard disk giving us a generation to running advanced algorithms
total of 37 TB HDFS capacity for doing behavioral analysis and targeting

● To store copies of internal log and dimension

320-machine cluster with 2,560 cores and data sources
Facebook about 1.3 PB raw storage ● As a source for reporting analytics and
machine learning
9
10
Quiz 1

Which type of data is handled by Hadoop?

a. Structured data
b. Semi-structured data
c. Unstructured data
d. Flexible-structure data

11
Quiz 1

Which type of data is handled by Hadoop?

a. Structured data
b. Semi-structured data
c. Unstructured data
d. Flexible-structure data

Answer: c.

Explanation: Hadoop handles unstructured data for processing.

12
Quiz 2

Which of the following is an unstructured data?

a. Collection of text files

b. Collection of XML files
c. Collection of tables in databases
d. Collection of tickets

13
Quiz 2

Which of the following is an unstructured data?

a. Collection of text files

b. Collection of XML files
c. Collection of tables in databases
d. Collection of tickets

Answer: a.

Explanation: Text files are usually unstructured data.

14
Quiz 3

Which of the following is structured data?

a. Collection of text files

b. Collection of tickets
c. Collection of tables in databases
d. Collection of XML files

15
Quiz 3

Which of the following is structured data?

a. Collection of text files

b. Collection of tickets
c. Collection of tables in databases
d. Collection of XML files

Answer: c.

Explanation: Databases are usually structured data.

16
Quiz
4

Which of the following is semi-structured data?

a. Collection of tables in databases

b. Collection of text files
c. Collection of tickets
d. Collection of XML files

17
Quiz 4

Which of the following is semi-structured data?

a. Collection of tables in databases

b. Collection of text files
c. Collection of tickets
d. Collection of XML files

Answer: d.

Explanation: XML files are usually semi-structured data.

18
Quiz 5

Which of the following aspects of Big Data refers to data size?

a. Volume
b. Velocity
c. Variety
d. Value

19
Quiz 5

Which of the following aspects of Big Data refers to data size?

a. Volume
b. Velocity
c. Variety
d. Value

Answer: a.

Explanation: Volume in Big Data refers to the size of the data to be processed.

20
Quiz 6

Which of the following aspects of Big Data refers to the speed of the response of appropriate data request generated
by the user?

a. Variety
b. Value
c. Velocity
d. Volume

21
Quiz 6

Which of the following aspects of Big Data refers to the speed of the response of appropriate data request generated
by the user?

a. Variety
b. Value
c. Velocity
d. Volume

Answer: c.

Explanation: Velocity in Big Data refers to the speed of the response of appropriate data request generated
by the user.

22
Quiz 7

Which of the following aspects of Big Data refers to multiple data sources?

a. Variety
b. Value
c. Volume
d. Velocity

23
Quiz 7

Which of the following aspects of Big Data refers to multiple data sources?

a. Variety
b. Value
c. Volume
d. Velocity

Answer: a.

Explanation: Variety in Big Data refers to multiple data sources.

24
Summary
Let us summarize the topics covered in this lesson:
● Big Data is the term applied to data sets whose size is beyond the ability
of the commonly used software tools to capture, manage, and process
within a tolerable elapsed time.
● Big Data relies on volume, velocity, and variety with respect to
processing.
● Data can be divided into 3 types—Unstructured data, semi-structured
data, and structured data.
● Hadoop is a free, Java-based programming framework that supports the
processing of large data sets in a distributed computing environment.
● Hadoop is a software framework used by organizations like Facebook,
25
26

Fletcher 2100 Manual
67% (3)
Fletcher 2100 Manual
27 pages
Sexual Force or The Winged Dragon (Izvor Collection)
No ratings yet
Sexual Force or The Winged Dragon (Izvor Collection)
65 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Osprey, Men-At-Arms #051 Spanish Armies of The Napoleonic Wars (1975) OCR 8.12
100% (10)
Osprey, Men-At-Arms #051 Spanish Armies of The Napoleonic Wars (1975) OCR 8.12
50 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Data Science
No ratings yet
Data Science
87 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
No ratings yet
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
6 pages
Intr Oduction of Big Data
No ratings yet
Intr Oduction of Big Data
12 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Lecture8 -Big Data (Hadoop)
No ratings yet
Lecture8 -Big Data (Hadoop)
29 pages
Unit II Big Data Final PDF
No ratings yet
Unit II Big Data Final PDF
25 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
Big Data: Introduction To Terms, Concepts and Tools
No ratings yet
Big Data: Introduction To Terms, Concepts and Tools
23 pages
Seminar Big Data Hadoop
No ratings yet
Seminar Big Data Hadoop
28 pages
Unit - I Introduction To Big Data
No ratings yet
Unit - I Introduction To Big Data
38 pages
00 HadoopWelcome Transcript
No ratings yet
00 HadoopWelcome Transcript
4 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Ashish_Presentation_Stage1_modify_LR
No ratings yet
Ashish_Presentation_Stage1_modify_LR
24 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Big Data_1
No ratings yet
Big Data_1
46 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
hadoop-big-data-unit-2
No ratings yet
hadoop-big-data-unit-2
23 pages
Module 02 - Learners Guide
No ratings yet
Module 02 - Learners Guide
82 pages
BDA Answers-1
No ratings yet
BDA Answers-1
15 pages
Module-2
No ratings yet
Module-2
34 pages
An Insight On Big Data Analytics Using Pig Script
No ratings yet
An Insight On Big Data Analytics Using Pig Script
7 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
Testing Big Data: Camelia Rad
No ratings yet
Testing Big Data: Camelia Rad
31 pages
BDH Admin Ebook
No ratings yet
BDH Admin Ebook
807 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
No ratings yet
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
7 pages
Big Data: Presented By, Nishaa R
No ratings yet
Big Data: Presented By, Nishaa R
24 pages
What Is Bigdata
No ratings yet
What Is Bigdata
5 pages
DOC-20250306-WA0000.
No ratings yet
DOC-20250306-WA0000.
35 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
biggdata
No ratings yet
biggdata
24 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
51 pages
Elementary Concepts of Big Data and Hadoop
No ratings yet
Elementary Concepts of Big Data and Hadoop
4 pages
Unit 3 Introduction To Hadoop Syllabus
No ratings yet
Unit 3 Introduction To Hadoop Syllabus
22 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
bda unit 1 - mam
No ratings yet
bda unit 1 - mam
198 pages
Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Big Data Analytics_Lecture Slides
No ratings yet
Big Data Analytics_Lecture Slides
72 pages
Big Data
No ratings yet
Big Data
25 pages
HADOOP
No ratings yet
HADOOP
55 pages
Ite06 Big Data Analytics-Qbank
No ratings yet
Ite06 Big Data Analytics-Qbank
18 pages
BigData_Unit1
No ratings yet
BigData_Unit1
74 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Lecture_4
No ratings yet
Lecture_4
32 pages
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
No ratings yet
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
5 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Mastering Pandas in Python: Course Book
From Everand
Mastering Pandas in Python: Course Book
Pedro Martins
No ratings yet
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
Building a Product Master
From Everand
Building a Product Master
Edufdev
No ratings yet
UNIT-1 - What Is A Hierarchy of Values
No ratings yet
UNIT-1 - What Is A Hierarchy of Values
4 pages
What Is Big Data - Introduction
No ratings yet
What Is Big Data - Introduction
6 pages
Updated MAD Lab File
No ratings yet
Updated MAD Lab File
68 pages
PHP List of Experiments
No ratings yet
PHP List of Experiments
5 pages
Sap Fi Gen Finperformance
No ratings yet
Sap Fi Gen Finperformance
14 pages
Journal Amdal PDF
No ratings yet
Journal Amdal PDF
18 pages
SCM UNIT 2
No ratings yet
SCM UNIT 2
8 pages
Automated Highways
No ratings yet
Automated Highways
6 pages
Door Knock Alarm PPT.2
No ratings yet
Door Knock Alarm PPT.2
11 pages
Divorcing Money From Monetary Policy: Todd Keister, Antoine Martin, and James Mcandrews
No ratings yet
Divorcing Money From Monetary Policy: Todd Keister, Antoine Martin, and James Mcandrews
16 pages
Artist's Sketchbook - This Conundrum The Sculpture of NN Rimzon - by Rasna Bhushan
No ratings yet
Artist's Sketchbook - This Conundrum The Sculpture of NN Rimzon - by Rasna Bhushan
4 pages
Manual Control de Acceso Dahua Asi-1212d-1
No ratings yet
Manual Control de Acceso Dahua Asi-1212d-1
32 pages
10th score card
No ratings yet
10th score card
1 page
Hmer 5
No ratings yet
Hmer 5
3 pages
Psychosocial Interventions For Dementia: From Evidence To Practice
No ratings yet
Psychosocial Interventions For Dementia: From Evidence To Practice
11 pages
Chapter 10: Nature of Research Design and Methods
No ratings yet
Chapter 10: Nature of Research Design and Methods
10 pages
Speaker 1 - Lim Hui Yan
100% (1)
Speaker 1 - Lim Hui Yan
16 pages
guided writing 2nd
No ratings yet
guided writing 2nd
8 pages
Civil Procedure Code - II (Arifa Diwan)
No ratings yet
Civil Procedure Code - II (Arifa Diwan)
11 pages
Questionnaire To Assess Customer Satisfaction On Purchase of Mi Mobiles
No ratings yet
Questionnaire To Assess Customer Satisfaction On Purchase of Mi Mobiles
10 pages
Books of Original Entry and Ledgers (II) : Short Questions
100% (2)
Books of Original Entry and Ledgers (II) : Short Questions
16 pages
Project On Education Market
No ratings yet
Project On Education Market
49 pages
Dual Acting Shaper
No ratings yet
Dual Acting Shaper
30 pages
Job Advertisements - Exercises
No ratings yet
Job Advertisements - Exercises
1 page
Understanding Ethiopia's Tigray War Martin Plaut - Own the complete ebook with all chapters in PDF format
100% (1)
Understanding Ethiopia's Tigray War Martin Plaut - Own the complete ebook with all chapters in PDF format
55 pages
Sujoy Dutta REPORT
No ratings yet
Sujoy Dutta REPORT
32 pages
RSPM Report Writing
No ratings yet
RSPM Report Writing
11 pages
ICTV Master Species List 2020.v1
No ratings yet
ICTV Master Species List 2020.v1
44 pages
Multiple Choice Competition: - Solution
No ratings yet
Multiple Choice Competition: - Solution
11 pages
DRAFT - : Harvard Extension School
No ratings yet
DRAFT - : Harvard Extension School
8 pages
Chem Argon - Q88 (Oil) - 01feb2024
No ratings yet
Chem Argon - Q88 (Oil) - 01feb2024
8 pages

Lect 2 Big Data Lesson01

Uploaded by

Lect 2 Big Data Lesson01

Uploaded by

Lesson 1

How to handle system uptime How to combine accumulated

● Commodity hardware for data ● Analyzing data across different

data across clusters

What is Hadoop? Why Hadoop?

● A free, Java-based ● Runs applications on

A9.com: ● To build Amazon's product search indices

More than 100,000 CPUs in approximately

Cluster size is 50 machines, Intel Xeon,

● To store copies of internal log and dimension

Which type of data is handled by Hadoop?

Which type of data is handled by Hadoop?

Explanation: Hadoop handles unstructured data for processing.

Which of the following is an unstructured data?

a. Collection of text files

Which of the following is an unstructured data?

a. Collection of text files

Explanation: Text files are usually unstructured data.

Which of the following is structured data?

a. Collection of text files

Which of the following is structured data?

a. Collection of text files

Explanation: Databases are usually structured data.

Which of the following is semi-structured data?

a. Collection of tables in databases

Which of the following is semi-structured data?

a. Collection of tables in databases

Explanation: XML files are usually semi-structured data.

Which of the following aspects of Big Data refers to data size?

Which of the following aspects of Big Data refers to data size?

Explanation: Variety in Big Data refers to multiple data sources.

You might also like