0% found this document useful (0 votes)

15 views33 pages

Chapter 2 Data Science

Data science is a multi-disciplinary field focused on extracting knowledge and insights from various forms of data using scientific methods and algorithms. It offers significant advantages, such as fraud detection and improved decision-making, but also faces challenges like data variety and a lack of skilled professionals. The document also discusses data types, the data processing cycle, and the concept of big data, highlighting its characteristics and applications.

Uploaded by

daniel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views33 pages

Chapter 2 Data Science

Uploaded by

daniel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

Chapter Two

Data Science
2.1. Overview of Data Science

 Data science is a multi-disciplinary field that uses scientific

methods, processes, algorithms, and systems to extract knowledge
and insights from structured, semi-structured and unstructured
data.
 Data Science is the area of study which involves extracting insights
from vast amounts of data by the use of various scientific methods,
algorithms, and processes. It helps you to discover hidden patterns
from the raw data.
Overview of Data Science
3
(I)
 Data Science is an interdisciplinary field that allows you to extract
knowledge from structured or unstructured data.
 Data science enables you to translate a business problem into a
research project and then translate it back into a practical solution.
Significant advantages of using Data Science

 Data is the oil for today's world. With the right tools, technologies,
algorithms, we can use data and convert it into a distinctive
business advantage.
 Data Science can help you to detect fraud using advanced machine
learning algorithms.
 It helps you to prevent any significant monetary losses.
Significant advantages of using Data
5
Science (II)

 Allows to build intelligence ability in machines

 You can perform sentiment analysis to gauge customer
brand loyalty
 It enables you to take better and faster decisions
 Helps you to recommend the right product to the right
customer to enhance your business.
Challenges of Data science
6

 High variety of information & data is required for accurate analysis

 Not adequate data science talent pool available
 Management does not provide financial support for a data science
team
 Unavailability of/difficult access to data
Challenges of Data science
7
(I)
 Data Science results not effectively used by business decision
makers
 Explaining data science to others is difficult
 Privacy issues
 Lack of significant domain expert
 If an organization is very small, they can't have a Data Science team
What are data and information?
8

 Data can be defined as a representation of facts, concepts, or

instructions in a formalized manner, which should be suitable for
communication, interpretation, or processing, by human or
electronic machines.
 It can be described as unprocessed facts and figures.
 It is represented with the help of characters such as alphabets (A-Z,
a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.).
What are data and information?
(I)
9

 Information is the processed data on which decisions and actions

are based.
 Information is data that has been processed into a form that is
meaningful to the recipient and is of real or perceived value in the
current or the prospective action or decision of recipient.
 Furtherer more, information is interpreted data; created from
organized, structured, and processed data in a particular context.
Data Processing Cycle
10

 Data processing is the re-structuring or re-ordering of data by

people or machines to increase their usefulness and add values for
a particular purpose.
 Data processing consists of the following basic steps: Input,
Processing and Output. These three steps constitute the data
processing cycle.

Fig. 1.Data processing Cycle

Data Processing Cycle (I)
11

 Input :- in this step, the input data is prepared in some convenient form for
processing.
 The form will depend on the processing machine.
 For example, when electronic computers are used, the input data can be recorded on
any one of the several types of storage medium, such as hard disk, CD, flash disk and
so on.
 Processing:- in this step, the input data is changed to produce data in a more
useful form.
 For example, interest can be calculated on deposit to a bank, or a summary of
sales for the month can be calculated from the sales orders.
Data Processing Cycle
(II)
12

Output-at this stage, the result of the proceeding processing step is

collected.
 The particular form of the output data depends on the use of the

data.
 For example, output data may be payroll for employees.
Data types and their representation

 Data types can be described from diverse perspectives.

 In computer science and computer programming, for instance, a
data type is simply an attribute of data that tells the compiler or
interpreter how the programmer intends to use the data.
Data types from Computer programming perspective
14
 Almost all programming languages explicitly include the notion of
data type, though different languages may use different
terminology. Common data types include:
 Integers(int):- is used to represent whole numbers, mathematically
known as integers
 Booleans(bool):- is used to represent restricted to one of two
values: true or false
 Characters(char):- is used to represent a single character
 Floating-point numbers(float)- is used to represent real numbers
 Alphanumeric strings(string):- used to represent a combination of
characters and numbers
Data types from Data Analytics perspective
15

 From a data analytics point of view, it is important to

understand that there are three common types of
data types or structures:
 Structured

 Semi-structured and

 Unstructured data types.

Data types from Data Analytics perspective
16
Structured Data
17

 Structured data is data that adheres to a pre-defined data

model and is therefore straightforward to analyze.
 Structured data conforms to a tabular format with a
relationship between the different rows and columns.
 Common examples of structured data are Excel files or SQL
databases.
 Each of these has structured rows and columns that can be
sorted.
Semi-structured Data
18

 Semi-structured data is a form of structured data that does not

conform with the formal structure of data models associated with
relational databases or other forms of data tables, but nonetheless,
contains tags or other markers to separate semantic elements and
enforce hierarchies of records and fields within the data.
 Therefore, it is also known as a self-describing structure.
 Examples of semi-structured data include JSON and XML are forms
of semi-structured data.
Unstructured Data
19
 Unstructured data is information that either does not have a
predefined data model or is not organized in a pre-defined manner.
 Unstructured information is typically text-heavy but may contain
data such as dates, numbers, and facts as well.
 This results in irregularities and ambiguities that make it difficult to
understand using traditional programs as compared to data stored in
structured databases.
 Common examples of unstructured data include audio, video files or
NoSQL.
Metadata – Data about Data
20

 The last category of data type is metadata.

 From a technical point of view, this is not a separate data structure,
but it is one of the most important elements for Big Data analysis
and big data solutions.
 Metadata is data about data.
 It provides additional information about a specific set of data.
 In a set of photographs, for example, metadata could describe
when and where the photos were taken.
Data value Chain
21

 The Data Value Chain is introduced to describe the information

flow within a big data system as a series of steps needed to
generate value and useful insights from data. The Big Data Value
Chain identifies the following key high-level activities:

Fig2.Data Value Chain

1. Data Acquisition
22

 It is the process of gathering, filtering, and cleaning data before it is put in a data
warehouse or any other storage solution on which data analysis can be carried
out.
 Data acquisition is one of the major big data challenges in terms of infrastructure
requirements.
 The infrastructure required to support the acquisition of big data must deliver
low, predictable latency in both capturing data and in executing queries; be able
to handle very high transaction volumes, often in a distributed environment and
support flexible and dynamic data structures.
2. Data Analysis
23

 It is concerned with making the raw data acquired amenable to use

in decision-making as well as domain-specific usage.
 Data analysis involves exploring, transforming, and modeling data
with the goal of highlighting relevant data, synthesizing and
extracting useful hidden information with high potential from a
business point of view.
 Related areas include data mining, business intelligence, and
machine learning.
3. Data Curation
24

 It is the active management of data over its life cycle to ensure it meets
the necessary data quality requirements for its effective usage.
 Data curation processes can be categorized into different activities
such as content creation, selection, classification, transformation,
validation, and preservation.
 Data curation is performed by expert curators that are responsible for
improving the accessibility and quality of data.
 Data curators (also known as scientific curators or data annotators)
hold the responsibility of ensuring that data are trustworthy,
discoverable, accessible, reusable and fit their purpose.
 A key trend for the duration of big data utilizes community and crowd
sourcing approaches.
4. Data Storage
25

 It is the persistence and management of data in a scalable way that

satisfies the needs of applications that require fast access to the data.
 Relational Database Management Systems (RDBMS) have been the
main, and almost unique, a solution to the storage paradigm for nearly
40 years.
 However, the ACID (Atomicity,Consistency,Isolation,and Durability)
properties that guarantee database transactions lack flexibility with
regard to schema changes and the performance and fault tolerance
when data volumes and complexity grow, making them unsuitable for
big data scenarios.
 NoSQL technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on alternative data
models.
5. Data Usage
26

 It covers the data-driven business activities that need

access to data, its analysis, and the tools needed to
integrate the data analysis within the business activity.
 Data usage in business decision making can enhance
competitiveness through the reduction of costs, increased
added value, or any other parameter that can be
measured against existing performance criteria.
Basic concepts of big data
27

What Is Big Data?

 Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
 In this context, a “large dataset” means a dataset too large
to reasonably process or store with traditional tooling or on a
single computer.
 This means that the common scale of big datasets is
constantly shifting and may vary significantly from
organization to organization.
 Big data is characterized by 4V and more:
28
 Volume: large amounts of data Zeta bytes/Massive datasets
 Velocity: Data is live streaming or in motion
 Variety: data comes in many different forms from diverse sources
 Veracity: can we trust the data? How accurate is it? etc.

Fig 3. Characteristics of Big data

Source of Big data

Mobile devices
(Tracking all objects all the time)
Areas of Applications of Big Data
30

Health and Well being

Policy making and public opinions
Smart cities and more efficient society
New online educational models: MOOC and
Student-Teacher modeling
Robotics and human-robot interaction
Areas of Applications of Big Data
31

Smarter Multi-
Healthcare channel
sales

Telecom
Homeland
Security

Trading
Analytics
TrafficControl

Search
Quality
Manufacturing
Big Data vs Data
Science
32

Factors Big Data Data Science

Concept Handling large Data Analyzing data
Responsibility Processing huge volume of Understand pattern
data and generate insights within and make
decisions
Industry E-commerce ,security Sales, image
services, telecommunication recognition,
advertisement ,risk
analytics
tools Hadoop Python ,R
33

THANK YOU
?

Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
52 pages
Chapter 2 - EMTE - 240216 - 133452
No ratings yet
Chapter 2 - EMTE - 240216 - 133452
47 pages
Chapter 2 Introduction To Data Science - For Extension
No ratings yet
Chapter 2 Introduction To Data Science - For Extension
51 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
(ET) Chapter - 2
No ratings yet
(ET) Chapter - 2
31 pages
Emerging CH2
No ratings yet
Emerging CH2
41 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
Chapter 2 (Data Science)
No ratings yet
Chapter 2 (Data Science)
35 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Data Science
No ratings yet
Data Science
35 pages
Ict Ch. 2
No ratings yet
Ict Ch. 2
38 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Chapter 2EMR
No ratings yet
Chapter 2EMR
21 pages
Data Science
No ratings yet
Data Science
32 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Emerging Tech CH 2
No ratings yet
Emerging Tech CH 2
52 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
CH-2 Introduction To Data Science
No ratings yet
CH-2 Introduction To Data Science
26 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
Course Name: Introduction To Emerging Technologies
No ratings yet
Course Name: Introduction To Emerging Technologies
24 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
35 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
8 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
Sans 10227
100% (4)
Sans 10227
15 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
37 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
41 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Right To Travel Brief
67% (6)
Right To Travel Brief
62 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
PV Valves Operation and Maintenance Procedure
100% (2)
PV Valves Operation and Maintenance Procedure
6 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Cima f7 dvanced-Financial-Reporting PDF
100% (1)
Cima f7 dvanced-Financial-Reporting PDF
590 pages
Building PYRTE - An Introduction PDF
No ratings yet
Building PYRTE - An Introduction PDF
14 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
5.size Oriented and Function Oriented Metrics
No ratings yet
5.size Oriented and Function Oriented Metrics
4 pages
Gardner Denver MH5 Hydrapak
0% (1)
Gardner Denver MH5 Hydrapak
8 pages
4K电影合集 - 副本
No ratings yet
4K电影合集 - 副本
19 pages
01 AUBF Notes On Lab Safety (HIGHLIGHTED)
No ratings yet
01 AUBF Notes On Lab Safety (HIGHLIGHTED)
5 pages
Milling Guidelines
No ratings yet
Milling Guidelines
13 pages
List of Books and Notebooks - 2025-26 Class 6-12
No ratings yet
List of Books and Notebooks - 2025-26 Class 6-12
7 pages
Grand Designs UK - November 2021
No ratings yet
Grand Designs UK - November 2021
156 pages
Catchlogs - 2023-01-24 at 22-09-19 - 7.13.2 - .Java
No ratings yet
Catchlogs - 2023-01-24 at 22-09-19 - 7.13.2 - .Java
31 pages
Pro Wrestling Illustrated, 2005-03 (2004 in Wrestling) (C)
No ratings yet
Pro Wrestling Illustrated, 2005-03 (2004 in Wrestling) (C)
148 pages
Fee Structure 2024 25 MBBS
No ratings yet
Fee Structure 2024 25 MBBS
1 page
Organization Behavior: Manish Awasthi
100% (1)
Organization Behavior: Manish Awasthi
11 pages
IOM Belven Ball Valves - General
No ratings yet
IOM Belven Ball Valves - General
3 pages
Module 33 - Related Party Disclosures
No ratings yet
Module 33 - Related Party Disclosures
60 pages
BP 36-56 Ingles
No ratings yet
BP 36-56 Ingles
16 pages
Electric Transport in The Netherlands
No ratings yet
Electric Transport in The Netherlands
44 pages
Journal of Accounting and Economics: Shuping Chen, Ying Huang, Ningzhong Li, Terry Shevlin T
No ratings yet
Journal of Accounting and Economics: Shuping Chen, Ying Huang, Ningzhong Li, Terry Shevlin T
19 pages
Employee Survey Questionnaire
No ratings yet
Employee Survey Questionnaire
1 page
February 6 Vdi Comparison Gberger PDF
No ratings yet
February 6 Vdi Comparison Gberger PDF
49 pages
Full Literature Review Sample
No ratings yet
Full Literature Review Sample
8 pages
ZEOFREE® 600 - Evonik
No ratings yet
ZEOFREE® 600 - Evonik
2 pages
Spru I 11444
No ratings yet
Spru I 11444
24 pages
Statement of Facts El Matador
No ratings yet
Statement of Facts El Matador
6 pages
Rules of NPKL
No ratings yet
Rules of NPKL
4 pages
RCM - Rs 07 Rack System Assembly - II
No ratings yet
RCM - Rs 07 Rack System Assembly - II
2 pages
Resume 1
No ratings yet
Resume 1
1 page

Chapter 2 Data Science

Uploaded by

Chapter 2 Data Science

Uploaded by

Chapter Two

 Data science is a multi-disciplinary field that uses scientific

 Allows to build intelligence ability in machines

 High variety of information & data is required for accurate analysis

 Data can be defined as a representation of facts, concepts, or

 Information is the processed data on which decisions and actions

 Data processing is the re-structuring or re-ordering of data by

Fig. 1.Data processing Cycle

Output-at this stage, the result of the proceeding processing step is

 Data types can be described from diverse perspectives.

 From a data analytics point of view, it is important to

 Unstructured data types.

 Structured data is data that adheres to a pre-defined data

 Semi-structured data is a form of structured data that does not

 The last category of data type is metadata.

 The Data Value Chain is introduced to describe the information

Fig2.Data Value Chain

 It is concerned with making the raw data acquired amenable to use

 It is the persistence and management of data in a scalable way that

 It covers the data-driven business activities that need

What Is Big Data?

Fig 3. Characteristics of Big data

Health and Well being

Factors Big Data Data Science

You might also like