0% found this document useful (0 votes)

5 views35 pages

Chapter 2 [Data Science]

Chapter Two provides an overview of data science, detailing its definition, the data processing cycle, and various data types including structured, semi-structured, and unstructured data. It also discusses the data value chain, big data concepts, and the Hadoop ecosystem, emphasizing the importance of clustered computing for handling large datasets. The chapter concludes with an outline of the big data life cycle stages within the Hadoop framework.

Uploaded by

nabilalihaji772

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views35 pages

Chapter 2 [Data Science]

Uploaded by

nabilalihaji772

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 35

CHAPTER TWO

Data Science

.
Page 2

Main Contents
Overview of Data Science

Data and Information

Data Processing Cycle

Data Science
Data Types and their Representation

Data Value Chain

Basic Concepts of Big Data

Clustered Computing and Hadoop Ecosystem

Overview of Data Science Page 3

 Data science is a multi-disciplinary field that uses scientific methods,

processes, algorithms, and systems to extract knowledge and insights

from structured, semi-structured and unstructured data.
 Let’s consider this idea by thinking about some of the data involved in

buying a box of cereal from the store or supermarket:

Whatever your cereal preferences teff, wheat, or barley you prepare

for the purchase by writing “cereal” in your notebook. This planned

purchase is a piece of data though it is written by pencil that you can

read. (This an example of data).

Data and Information Page 4

Data
 Is representation of facts, concepts, or instructions in a

formalized manner, which should be suitable for

communication, interpretation, or processing, by human or
electronic machines.
 It can be described as unprocessed facts and figures.

 Can be represented with the help of characters such as

alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /,

*, <,>, =, etc.).
Data and Information … Page 5

Information
 Is the processed data on which decisions and actions are

based.
 It is data that has been processed into a form that is

meaningful to the recievers.

 Information is interpreted data: created from organized,

structured, and processed data in a particular context.

Data Processing Cycle Page 6

 Data processing is the re-structuring or re-ordering of data by people

or machines to increase their usefulness and add values for a particular

purpose.
 Data processing consists of the following basic steps:

 Input, processing, and output

 These three steps constitute the data processing cycle.

Input Processing Output

Output
Data Processing Cycle
Data Processing Cycle… Page 7

Input
 In this step, the input data is prepared in some convenient form for

processing.
 The form will depend on the processing machine.

 Any information that is provided to a computer or a software

program is known as input.

 The input enables the computer to do what is designed to do and

produce an output.
Example: [keyboard, mouse...]
Data Processing Cycle… Page 8

Processing
In this step, the input data is changed to produce data in a more useful

form.
Example: [CPU, GPU, Network Interface Cards…]
Data Processing Cycle… Page 9

Output
At this stage, the result of the proceeding processing step is collected.

The particular form of the output data depends on the use of the data.

Example: [Monitor, Printer, Projector…]

Data Types and their Representation Page
10

 In computer programming, a data type is an attribute of data that tells

the compiler or interpreter how the programmer intends to use the data.
Data types from Computer programming perspective
 The Common data types include

 Integers(int)- is used to store whole numbers, integers

 Booleans(bool)- is used to represent true or false.

 Characters(char)- is used to store a single character like “A”.

 Floating-point numbers(float)- is used to store real numbers

 Alphanumeric strings(string)- used to store a combination of

characters and numbers like “ddu01256”.

Data Types and their Representation Page 11

Data types from Data Analytics perspective

 From a data analytics point of view, it is important to understand that

there are three common types of data types or structures:

 Structured

 Semi-structured, and

 Unstructured data types

 The fourth data type is metadata which data of data.

 The following figure describes the three types of data and metadata.
Data Types and their Representation… Page
12

Structured Data
 Structured data is data that adheres to a pre-defined data model and is

therefore straight forward to analyze.

 Structured data conforms to a tabular format with a relationship

between the different rows and columns.

Example: Excel files , Coma Separated Value files (.csv) and SQL
database files.
 Each of these has structured rows and columns that can be sorted.
Data Types and their Representation… Page 13

Semi-structured Data
 Semi-structured data is a form of structured data that does not

conform with the formal structure of data models associated with

relational databases or other forms of data tables.
 It contains tags or other markers to separate semantic elements and

enforce hierarchies of records and fields within the data. Therefore, it is

also known as a self-describing structure.
Examples: JSON (JavaScript Object Notation) and XML (Extended
Markup Languages) are forms of semi-structured data.
Data Types and their Representation… Page 14

Unstructured Data
 Unstructured data is information that either does not have a

predefined data model or is not organized in a pre-defined manner.

 Unstructured information is typically text-heavy but may contain data

such as dates, numbers, and facts as well.

 This results in irregularities and ambiguities that make it difficult to

understand using traditional programs as compared to data stored in

structured databases.
Example: Audio, video files and NoSQL (None SQL) databases.
Data Types and their Representation… Page 15

Metadata (Data about Data)

 From a technical point of view, this is not a separate data structure, but

it is one of the most important elements for Big Data analysis and big
data solutions.
 Metadata is data about data.

 It provides additional information about a specific set of data.

 Metadata is frequently used by Big Data solutions for initial analysis.

 In a set of photographs, for example, metadata could describe when

and where the photos were taken. The metadata then provides fields for
dates and locations which, by themselves, can be considered structured
data.
Data Types and their Representation… Page 16

Meta Data
Data Value Chain Page 17

 The Data Value Chain is concerned with describing the information flow within

a big data system as a series of steps needed to generate value and useful insights
from data.
 The data value chain describes the evolution of data from collection to analysis,

dissemination, and the final impact of data on decision making.

• The Big Data Value Chain identifies the following key high-level activities:
Data Value Chain… Page 18

Data Acquisition
 It is the process of gathering, filtering, and cleaning data before it is put in a data

warehouse or any other storage solution on which data analysis can be carried
out.
 Data acquisition is one of the major big data challenges in terms of infrastructure

requirements.
 The infrastructure required to support the acquisition of big data must deliver

low, predictable latency in both capturing data and in executing queries; be able
to handle very high transaction volumes, often in a distributed environment; and
support flexible and dynamic data structures.
Data Value Chain… Page 19

Data Analysis
 It is concerned with making the raw data acquired amenable to use in
decision-making as well as domain-specific usage.
 Data analysis involves:

 Exploring,

 Transforming, and

 Modeling data

 The main goal of data analysis is highlighting relevant data, synthesizing and

extracting useful hidden information with high potential from a business point
of view.
 Related areas include
Data Value Chain… Page 20

Data Curation
 It is the active management of data over its life cycle to ensure it meets the

necessary data quality requirements for its effective usage.

 Data curation processes can be categorized into different activities such as content

creation, selection, classification, transformation, validation, and preservation.

 Data curation is performed by expert curators that are responsible for improving

the accessibility and quality of data.

 Data curators (scientific curators or data annotators) hold the responsibility of

ensuring that data are trustworthy, discoverable, accessible, reusable and fit their
purpose.
 A key trend for the duration of big data utilizes community and crowdsourcing

approaches.
Data Value Chain… Page 21

Data Storage
 It is the persistence and management of data in a scalable way that

satisfies the needs of applications that require fast access to the data.
 Relational Database Management Systems (RDBMS) have been the

main, and almost unique, a solution to the storage paradigm for nearly
40 years.
 Not Only SQL (NoSQL) technologies have been designed with the

scalability goal in mind and present a wide range of solutions based on

alternative data models.
Data Value Chain… Page 22

Data Usage
 It covers the data-driven business activities that need access to data,

its analysis, and the tools needed to integrate the data analysis within
the business activity.
 Data usage in business decision-making can enhance

competitiveness through the reduction of costs, increased added value,

or any other parameter that can be measured against existing
performance criteria.
Basic Concepts of Big Data Page 23

 Big data is a blanket term for the non-traditional strategies and

technologies needed to gather, organize, process, and gather insights

from large datasets.
 While the problem of working with data that exceeds the computing

power or storage of a single computer is not new, the pervasiveness,

scale, and value of this type of computing have greatly expanded in
recent years.
Basic Concepts of Big Data Page 24

What is Big Data?

 Big data is the term for a collection of data sets so large and complex

that it becomes difficult to process using on-hand database

management tools or traditional data processing applications.
 In this context, a “large dataset” means a dataset too large to

reasonably process or store with traditional tooling or on a single

computer.
 Big data is characterized by 3V and more: Volume, Velocity, Variety

and Veracity
Basic Concepts of Big Data Page 25

Characteristics of Big Data

 Volume: large amounts of data /Massive datasets

 Velocity: Data is live streaming or in motion

 Variety: data comes in many different forms from diverse sources

 Veracity: can we trust the data? How accurate is it?

Clustered Computing and Hadoop Ecosystem Page 26

Clustered Computing
Because of the qualities of big data, individual computers are often

inadequate for handling the data at most stages.

To better address the high storage and computational needs of big data,

computer clusters are a better fit.

Big data clustering software combines the resources of many smaller

machines, seeking to provide a number of benefits:

 Resource Pooling

 High Availability

 Easy Scalability
Clustered Computing and Hadoop Ecosystem… Page 27

Resource Pooling
 Combining the available storage space to hold data.

High Availability
 Availability guarantees to prevent hardware or software failures from

affecting access to data and processing.

Easy Scalability
 Clusters make it easy to scale horizontally by adding additional

machines to the group. This means the system can react to changes in
resource requirements without expanding the physical resources on a
machine.
Clustered Computing and Hadoop Ecosystem… Page 28

 Using clusters requires a solution for managing cluster membership,

coordinating resource sharing, and scheduling actual work on

individual nodes.
 Cluster membership and resource allocation can be handled by

software like Hadoop’s YARN (which stands for Yet Another Resource
Negotiator).
 The assembled computing cluster often acts as a foundation that other

software interfaces with to process the data.

Clustered Computing and Hadoop Ecosystem… Page 29

Hadoop and its Ecosystem

Hadoop is an open-source framework intended to make interaction with

big data easier.

 It is a framework that allows for the distributed processing of large

datasets across clusters of computers using simple programming models.

 It is inspired by a technical document published by Google. The four

key characteristics of Hadoop are:

 Economical

 Reliable

 Scalable

 Flexible
Clustered Computing and Hadoop Ecosystem… Page 30

The key characteristics of Hadoop:

 Economical: Its systems are highly economical as ordinary

computers can be used for data processing.

 Reliable: It is reliable as it stores copies of the data on different

machines and is resistant to hardware failure.

 Scalable: It is easily scalable both, horizontally and vertically. A

few extra nodes help in scaling up the framework

 Flexible: It is flexible and you can store as much structured and

unstructured data as you need to and decide to use them later.

Clustered Computing and Hadoop Ecosystem… Page 31

 Hadoop has an ecosystem that has evolved from its four core

components: data management, access, processing, and storage.

 It is continuously growing to meet the needs of Big Data.

 It comprises the following main components and many others:

• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query-based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Clustered Computing and Hadoop Ecosystem… Page 32
Big Data Life Cycle with Hadoop (Stages) Page 33

1. Ingesting data into the system:

 The first stage of Big Data processing is Ingest.

 The data is ingested or transferred to Hadoop from various sources such

as relational databases, systems, or local files.

 Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers

event data.
2. Processing the data in storage:
 The second stage is Processing.

 In this stage, the data is stored and processed.

 The data is stored in the distributed file system, HDFS, and the NoSQL

distributed data, HBase.


Big Data Life Cycle with Hadoop… Page 34

 Computing and analyzing data

 The third stage is to Analyze.

 Here, the data is analyzed by processing frameworks such as Pig, Hive,

and Impala.
 Pig converts the data using a MapReduce and then analyzes it.
 Hive is also based on the MapReduce programming and is most suitable
for structured data.

 Visualizing the results

 The fourth stage is Access, which is performed by tools such as Hue and

Cloudera Search.
 In this stage, the analyzed data can be accessed by users.
Page 35

?
END OF CHAPTER TWO
Next:- Chapter Three [Artificial Intelligence]

1#Ppt Innovation and Design Thinking by DR Firozkhan
No ratings yet
1#Ppt Innovation and Design Thinking by DR Firozkhan
70 pages
Data Science
No ratings yet
Data Science
32 pages
Chapter 2 - EMTE_240216_133452
No ratings yet
Chapter 2 - EMTE_240216_133452
47 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
2 Data-Science PDF
No ratings yet
2 Data-Science PDF
49 pages
Presentation 33360 Content Document 20250319044717PM
No ratings yet
Presentation 33360 Content Document 20250319044717PM
126 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Emerging Tech Ch 2
No ratings yet
Emerging Tech Ch 2
52 pages
ict Ch. 2
No ratings yet
ict Ch. 2
38 pages
Chapter 2 Introduction to Data Science_for Extension
No ratings yet
Chapter 2 Introduction to Data Science_for Extension
51 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
U - 02 ET
No ratings yet
U - 02 ET
24 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
Islamic answer
No ratings yet
Islamic answer
27 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
EmTec Chapter 2 (1)
No ratings yet
EmTec Chapter 2 (1)
32 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
HTC Emerging Ch2
No ratings yet
HTC Emerging Ch2
37 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
ET_Ch-2_Data_Science_ppt (2)
No ratings yet
ET_Ch-2_Data_Science_ppt (2)
28 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Emergency chapter two(2)
No ratings yet
Emergency chapter two(2)
41 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Chapter 2. Introduction to Data Science
No ratings yet
Chapter 2. Introduction to Data Science
41 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
CH-2 Introduction To Data Science
No ratings yet
CH-2 Introduction To Data Science
26 pages
Chapter 2 - Introduction to Data Science
No ratings yet
Chapter 2 - Introduction to Data Science
37 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Data Science
No ratings yet
Data Science
35 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2 Data Science (4)
No ratings yet
Chapter 2 Data Science (4)
8 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
ETCh2
No ratings yet
ETCh2
36 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Chapter 2 - Intro to Data Sciences[2]
No ratings yet
Chapter 2 - Intro to Data Sciences[2]
41 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
SSP Appendix A High FedRAMP Security Controls
No ratings yet
SSP Appendix A High FedRAMP Security Controls
531 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Step by Step Guide For EFRIS Device and Thumbprint Registration
No ratings yet
Step by Step Guide For EFRIS Device and Thumbprint Registration
10 pages
Slip Solution
No ratings yet
Slip Solution
14 pages
Power Bi
100% (1)
Power Bi
198 pages
NSCOA LabGuide v24.02
No ratings yet
NSCOA LabGuide v24.02
81 pages
Usabilla Presentation
No ratings yet
Usabilla Presentation
32 pages
Job Description - Housekeeping Manager PDF
100% (1)
Job Description - Housekeeping Manager PDF
2 pages
22412 2019 Winter Model Answer Paper[Msbte Study Resources]
No ratings yet
22412 2019 Winter Model Answer Paper[Msbte Study Resources]
24 pages
Module ChatGPT
No ratings yet
Module ChatGPT
15 pages
Maximum Power Transfer
No ratings yet
Maximum Power Transfer
18 pages
Capstone Presentation A
No ratings yet
Capstone Presentation A
20 pages
Control Board: UL325 - UL991
No ratings yet
Control Board: UL325 - UL991
19 pages
Event-Driven Programming: Key Terms
No ratings yet
Event-Driven Programming: Key Terms
38 pages
Understanding FOSS Licenses
No ratings yet
Understanding FOSS Licenses
12 pages
d301708x012 Wi HART IEC 62591 PDF
No ratings yet
d301708x012 Wi HART IEC 62591 PDF
74 pages
Visual C++ Programming Final
No ratings yet
Visual C++ Programming Final
9 pages
User's Manual: CNPS9900 Max
No ratings yet
User's Manual: CNPS9900 Max
11 pages
8051901443791812-Artificial Intelligence Sustainable Farming Presentation
No ratings yet
8051901443791812-Artificial Intelligence Sustainable Farming Presentation
29 pages
Dot1x LAB
No ratings yet
Dot1x LAB
12 pages
Siemens Collimator Installation On GE MC
No ratings yet
Siemens Collimator Installation On GE MC
32 pages
syllabus 2023-24-class-ii
No ratings yet
syllabus 2023-24-class-ii
6 pages
Map-Reduce (Hadoop) Based Data Clustering For BigData A Survey
No ratings yet
Map-Reduce (Hadoop) Based Data Clustering For BigData A Survey
6 pages
Conveyance Bill: Rony Chandra Saha / MD - Rashal Miha
No ratings yet
Conveyance Bill: Rony Chandra Saha / MD - Rashal Miha
5 pages
Q2 Lesson 1 - Determinant, Minor and Cofactor of A Matrix
No ratings yet
Q2 Lesson 1 - Determinant, Minor and Cofactor of A Matrix
3 pages
Document From Gomathy?
No ratings yet
Document From Gomathy?
3 pages
HW 2022 6
No ratings yet
HW 2022 6
2 pages
Mastering Controller Features
No ratings yet
Mastering Controller Features
3 pages
Ewa Augustine
No ratings yet
Ewa Augustine
2 pages
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet