0% found this document useful (0 votes)
10 views

Lect 2 Big Data Lesson01

This lesson introduces the concepts of Big Data and Hadoop. It defines Big Data as large data sets that cannot be processed by traditional software. It describes the three V's of Big Data as volume, velocity, and variety. It also defines the different types of data as unstructured, semi-structured, and structured. Finally, it provides an overview of Hadoop as an open-source framework for distributed storage and processing of large data sets across clusters of computers.

Uploaded by

Paritosh Belekar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lect 2 Big Data Lesson01

This lesson introduces the concepts of Big Data and Hadoop. It defines Big Data as large data sets that cannot be processed by traditional software. It describes the three V's of Big Data as volume, velocity, and variety. It also defines the different types of data as unstructured, semi-structured, and structured. Finally, it provides an overview of Hadoop as an open-source framework for distributed storage and processing of large data sets across clusters of computers.

Uploaded by

Paritosh Belekar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Lesson 1

Objectives
By the end of this lesson, you will be
able to:
 Explain the need for Big Data
 Define the concept of Big Data
 Describe the basics and benefits of
Hadoop

2
Need for Big Data

90% of the data in the world today has been created in last two years alone.
Structured format has some limitations with respect to handling large
quantities of data. Thus, there is a need for perfect mechanism, like Big Data, to
handle these increasing quantities.
Big Data relies on three important aspects of data complexity as explained in
the following image.

3
What is Big Data
Big Data is the term applied to data sets whose size is beyond the ability of
Defining Big Data the commonly used software tools to capture, manage, and process within a
tolerable elapsed time.

● Web logs
● Sensor network
● Social media
● Internet text and documents
● Internet pages
Sources of Big Data ● Search index data
● Atmospheric science, astronomy, biochemical, medical records
● Scientific research
● Military surveillance
● Photography archives

4
Types of Data
Three types of data can be identified:
Unstructured Data
Data which do not have a pre-defined data model
E.g. Text files

Semi-structured Data
Data which do not have a formal data model
E.g. XML files

Structured Data
Data which is represented in a tabular format
E.g. Databases

5
Handling Limitations of Big Data

How to handle system uptime How to combine accumulated


and downtime data from all the systems

● Commodity hardware for data ● Analyzing data across different


storage and analysis machines
● Maintaining a copy of same ● Merging of data

data across clusters

6
Introduction to Hadoop

What is Hadoop? Why Hadoop?

● A free, Java-based ● Runs applications on


programming framework that distributed systems with
supports the processing of thousands of nodes involving
large data sets in a distributed petabytes of data
computing environment ● Distributed file system
● Based on Google File System provides fast data transfers
(GFS) among the nodes

7
History and Milestones of Hadoop

Hadoop originated from Nutch open source project on search engine to work
over distributed network nodes. Yahoo was the first company to make and use
Hadoop as a core part of their system operations. Now Hadoop is a core part in
systems like Facebook, LinkedIn, Twitter, etc.
Hadoop Milestones

8
Organizations Using Hadoop
Name of the
organization Cluster specifications Uses

A9.com: ● To build Amazon's product search indices


Clusters vary from 1 to 100 nodes ● Process millions of sessions daily for analytics
Amazon

More than 100,000 CPUs in approximately


20,000 computers running Hadoop; ● To support research for ad systems and web
Yahoo biggest cluster has 2000 nodes (2*4cpu search
boxes with 4 TB disk each)

Cluster size is 50 machines, Intel Xeon,


dual processors, dual core, each with 16 ● For a variety of functions ranging from data
AOL GB RAM and 800 GB hard disk giving us a generation to running advanced algorithms
total of 37 TB HDFS capacity for doing behavioral analysis and targeting

● To store copies of internal log and dimension


320-machine cluster with 2,560 cores and data sources
Facebook about 1.3 PB raw storage ● As a source for reporting analytics and
machine learning
9
10
Quiz 1

Which type of data is handled by Hadoop?

a. Structured data
b. Semi-structured data
c. Unstructured data
d. Flexible-structure data

11
Quiz 1

Which type of data is handled by Hadoop?

a. Structured data
b. Semi-structured data
c. Unstructured data
d. Flexible-structure data

Answer: c.

Explanation: Hadoop handles unstructured data for processing.

12
Quiz 2

Which of the following is an unstructured data?

a. Collection of text files


b. Collection of XML files
c. Collection of tables in databases
d. Collection of tickets

13
Quiz 2

Which of the following is an unstructured data?

a. Collection of text files


b. Collection of XML files
c. Collection of tables in databases
d. Collection of tickets

Answer: a.

Explanation: Text files are usually unstructured data.

14
Quiz 3

Which of the following is structured data?

a. Collection of text files


b. Collection of tickets
c. Collection of tables in databases
d. Collection of XML files

15
Quiz 3

Which of the following is structured data?

a. Collection of text files


b. Collection of tickets
c. Collection of tables in databases
d. Collection of XML files

Answer: c.

Explanation: Databases are usually structured data.

16
Quiz
4

Which of the following is semi-structured data?

a. Collection of tables in databases


b. Collection of text files
c. Collection of tickets
d. Collection of XML files

17
Quiz 4

Which of the following is semi-structured data?

a. Collection of tables in databases


b. Collection of text files
c. Collection of tickets
d. Collection of XML files

Answer: d.

Explanation: XML files are usually semi-structured data.

18
Quiz 5

Which of the following aspects of Big Data refers to data size?

a. Volume
b. Velocity
c. Variety
d. Value

19
Quiz 5

Which of the following aspects of Big Data refers to data size?

a. Volume
b. Velocity
c. Variety
d. Value

Answer: a.

Explanation: Volume in Big Data refers to the size of the data to be processed.

20
Quiz 6

Which of the following aspects of Big Data refers to the speed of the response of appropriate data request generated
by the user?

a. Variety
b. Value
c. Velocity
d. Volume

21
Quiz 6

Which of the following aspects of Big Data refers to the speed of the response of appropriate data request generated
by the user?

a. Variety
b. Value
c. Velocity
d. Volume

Answer: c.

Explanation: Velocity in Big Data refers to the speed of the response of appropriate data request generated
by the user.

22
Quiz 7

Which of the following aspects of Big Data refers to multiple data sources?

a. Variety
b. Value
c. Volume
d. Velocity

23
Quiz 7

Which of the following aspects of Big Data refers to multiple data sources?

a. Variety
b. Value
c. Volume
d. Velocity

Answer: a.

Explanation: Variety in Big Data refers to multiple data sources.

24
Summary
Let us summarize the topics covered in this lesson:
● Big Data is the term applied to data sets whose size is beyond the ability
of the commonly used software tools to capture, manage, and process
within a tolerable elapsed time.
● Big Data relies on volume, velocity, and variety with respect to
processing.
● Data can be divided into 3 types—Unstructured data, semi-structured
data, and structured data.
● Hadoop is a free, Java-based programming framework that supports the
processing of large data sets in a distributed computing environment.
● Hadoop is a software framework used by organizations like Facebook,
25
26

You might also like