0% found this document useful (0 votes)
18 views

Understanding Data and Its Types-Lecture 1

The document discusses data and data analysis, noting that data is the outcome of random systems and contains information, while data analysis uses statistical tools and machine learning to understand random systems and make predictions. Data analysis aims to predict outcomes using statistical or machine learning models along with accounting for potential errors, allowing for analysis of larger datasets with greater accuracy. Machine learning techniques can incorporate heuristic understanding of data beyond exact statistical models.

Uploaded by

ikki123123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Understanding Data and Its Types-Lecture 1

The document discusses data and data analysis, noting that data is the outcome of random systems and contains information, while data analysis uses statistical tools and machine learning to understand random systems and make predictions. Data analysis aims to predict outcomes using statistical or machine learning models along with accounting for potential errors, allowing for analysis of larger datasets with greater accuracy. Machine learning techniques can incorporate heuristic understanding of data beyond exact statistical models.

Uploaded by

ikki123123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Data and Data Analysis

ti sti cs
Ikram E Khuda / Sta
RCD
,A
2 023
u d a©
K h
E
I kram
y
pe db
e lo
nt dev
te
Con
Recognize the Data
• In mathematics there are two types of experiments
• Deterministic and
• Non deterministic (also called random)
sti cs
S tati
• By deterministic it is meant that which can be determined. Hence those experiments D/
whose outputs can be found using a derived mathematical
RC
formula and with no variations in it is called deterministic systems
23 ,A
©20
u d a
h Hence those experiments whose outputs cannot be found because there is
• EK
By non deterministic it is meant that which can not be determined.
ra m or random systems.
no available mathematical formula is called non deterministic
yI k
b
o p ed
• Output of random systems are associated evelwith uncertainties …doubt and mistrust.
e ntd
t
Con
• Data is the outcome variable of a random system

• Data determines the facts and figures which contain information in them, i.e. data is raw form of information

• Information is an entity that resolves problems containing uncertainty. Resolve is different from solve. Resolves means to brining the problem
to an end or to its conclusion. Solve is the process of finding an answer.
sti cs
S tati
/
Identify a random2023system ? , AR
CD

a ©
u d
E Kh
I kram
y
pe db
e lo
nt dev
te
Con
Example 1

• Consider the following System A cs


sti
S tati
D/
C
Input as X=1.2,1.2,..,1.2 , AR Output as Y=2.1,2.1,…,2.1
A 2 023
u d a©
E Kh
I kram
y
dbpe
lo
d eve
What
te ntis the output of System A for an input of 1.2?
Con
?

It is 2.1
Example 2
• Consider the following System B
sti cs
S tati
D/
C
Input as X=1.2,1.2,..,1.2 , AR Output as Y=2.1,1.8,…,2.3,1.7,2.2,2.1..
B 2 023
u d a©
Kh
E
m
y Ikra
d b
What is the eoutput of System B for an input of 1.2?
o p
evel
e ntd ?
nt
Co
It is .. How do I know?
Articulate Data Analysis
• How can we find the output variable Y in System B?

Approach 1: cs
sti
Open up the system and trace the input to output flow through itSto tatifind the output
D/
This an engineering approach C
3 , AR
2 02
Approach 2:
u d a©
h
Use the data or the random variable (short: rv) E K to understand the System B to an extent that we are able to
characterize System and be able to predict I kramthe output of System B
d by
lo pe
• Approach 2 in short we dcalleveas Data Analysis
e t
nby
t
• We do Data Analysis applying statistical tools
Con
• By using Machine Learning tools. Machine Learning uses data analysis (i.e. statistical) and computer algorithms to
imitate the way humans would understand working of a random system (e.g. System B)

• In any case the whole purpose of data analysis is to do predictions and hence perform decision making.

• This whole process can be summarized in the following equation:


𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑐𝑜𝑚𝑒=𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑎𝑙 𝑚𝑜𝑑𝑒𝑙± 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑡h𝑒𝑚𝑜𝑑𝑒𝑙

or
𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑐𝑜𝑚𝑒=𝑚𝑎𝑐h𝑖𝑛𝑒 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔𝑚𝑜𝑑𝑒𝑙 ± 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑡h𝑒 𝑚𝑜𝑑𝑒𝑙
sti cs
S tati
D/
C
, AR
2 023
u d a©
h
ram
E K
statistical models +
I k
pe db
y computer algorithms
e lo
nt dev
te
Con

Machine Learning
Discuss Data Analysis
• With the usage of computer algorithms, bigger data with better
accuracy in lesser time could be achieved.
sti cs
S tati
• It also enable to include heuristic understanding RCD/ of data; which is not
,A
conceivable using exact mathematical ©20 models used in statistics.
23
u d a
h
EK
I kram
• This is the essence ofohuman d by intelligence.
p e
evel
e ntd
t
Con
For example while driving a car, a better driver is one who is trained/
experienced with the car and the road conditions rather then the one
who is making calculations with road turns or angles of road deviations !
Compare Data Types
• There is no one definition of data types
• In terms of mathematical representations
• Integer ( no decimal values)
sti cs
• Continuous (with decimal values) S tati
D/
C
, AR
• In terms of content 2 023
• Numeric or quantitative (numbers integer or continuous)
u d a©
h
EK
• String or qualitative ( alphabetic or alphanumeric)
I kram
• In terms of count d by
lo pe
• Discrete (finite count), can e numeric or string
evbe
• Continuous (infinite t d
ncount)
te , always numeric
Con
• In terms of levels of measurements
• Nominal (just labels, can be discrete, integers, quantitative or qualitative)
• Ordinal (showing order or ranks among data values, can be discrete, integers, quantitative or qualitative)
• Interval (always Continuous but includes those variables where zero is not defined)
• Ratio (always Continuous but includes those variables where zero is defined)
Interval and Ratio are together also called as Scale.
Compare Data Types
• Data types can also be described by the way they are analysised
• Sample Data
• Population Data sti cs
S tati
D/
C
, AR
2 023
• If whole data is used for data analysis
u d a © ,then it is called a population
h
data EK
ram
by Ik
pe d
e lo
dev
nt is used from some data set then it is called a
• If a fraction ofodata
te
C n
sample data
sti cs
S tati
D/
C
, AR
23
2 0
Population Data
a ©
d
E K hu
I kram Sample
y
pe db Data
e lo
nt dev
te
Con
Compare Data Types
• Data is also classified in terms of its dimensions
• Low dimension data is one with low features
• High dimension data is one with higher features sti cs
S tati
D/
C
, AR
2 023
High-dimensional data are defined as data in which the number
u d a ©of features (variables observed), p,
h
are close to or larger than the number of observations (or E Kdata points), n.
I kram
The opposite is low-dimensional data in which y number of observations, n,
bthe
pe d
far outnumbers the number of features,evp. e lo
nt d
te
A related concept is wide data, Con which refers to data with numerous features irrespective of the number of observations
(similarly, tall data is often used to denote data with a large
number of observations).

Analyses of high-dimensional data require consideration of potential problems that come from having more features than
observations.
Example 2 (cont’d)
• In Example 2, the output variable Y is called the data or random
variable (rv) sti cs
S tati
D/
• Every single value of Y is called an event. C
3 , AR
2 02
• Every event has a chance of occurrence. u d a©
h
EK
• This chance of occurrence I ramcalled the Probability of an event or
is
k
d by
o pe
• Mathematically this
t d eveprobability can be calculated as:
l

nten
Co 𝑇𝑜𝑡𝑎𝑙 𝑁𝑜. 𝑜𝑓 𝐹𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝐶𝑎𝑠𝑒𝑠 𝑜𝑓 𝑎𝑛 𝐸𝑣𝑒𝑛𝑡
𝑃 ( 𝐸 )=
𝑇𝑜𝑡𝑎𝑙 𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑝𝑎𝑐𝑒
Classification of Probability
The way a sample space is defined categorizes whether the probability is
being theoretically calculated or experimentally.
sti cs
There are two different types of probability that we S tatioften talk about: theoretical
probability and experimental probability. D/
C
3 , AR
2 02
Theoretical probability describes how likelydaan © event is to occur. We know that a coin is
u
Kh theoretical probability of getting heads is 1/2.
equally likely to land heads or tails, so Ethe
I kram
Experimental probability describes d by how frequently an event actually occurred in an
o p e
experiment. So if you tossed evel a coin 20 times and got heads 8 times, the experimental
probability of getting e ntd
heads would be 8/20, which is the same as 2/5, or 0.4, or 40%.
ont
C
The theoretical probability of an event will always be the same, but the experimental
probability is affected by chance, so it can be different for different experiments.

The more trials you carry out (for example, the more times you toss the coin), the closer
the experimental probability is likely to be to the theoretical probability.
Rules of Probability
• Probability of an event ,i.e. is always a real no. between 0 and 1
sti cs
S tati
D/
C
, AR
023
• Sum of the probability of all the mutually u d a©
2
exclusive
h
EK
I kram
d by
lo pe
d eve
te nt
Con
Example 3 sti cs
tati
(Extrapolate what we have , A RCD/ S
discussed so
023
far) EK
h u d a©
2

I kram
y
pe db
e lo
nt dev
te
Con
string, discrete and nominal Data Types

string and qualitative

sti cs
S tati
D/
C
, AR
2 023
u d a©
K h
E
I kram
y
pe db
e lo string, discrete and nominal
nt dev
te
Con

string, discrete and ordinal


string, discrete and ordinal Data Types

sti cs
S tati
D/
C
, AR
2 023
u d a©
K h
E
I kram numeric, continuous and ratio/ scale
y
pe db
e lo
nt dev
te
Con
Data Types
integers, discrete and ordinal

sti cs
S tati
D/
C
, AR
2 023
u d a© string and qualitative
K h
E
I kram
y
pe db
e lo
nt dev
te
Con
Data Types
5 variables
string, discrete and ordinal

sti cs
S tati
D/
C
, AR
2 023
u d a©
K h
E
I kram
y
pe db
e lo
nt dev
te
Con
Example of Data in a Data
Editor

sti cs
S tati
D/
C
, AR
2 023
u d a©
K h
E
I kram
y
pe db
e lo
nt dev
te
Con
low dimensional data

No. of variables/ features s < no. of observations


sti c
ati
D/ St
, A RC
2 023
u d a©
h
EK
I kram
y
pe db
e lo
nt dev
te
Con
high dimensional data

No. of variables/ features > no. of observations

sti cs
S tati
D/
C
, AR
2 023
u d a©
K h
E
I kram
y
pe db
e lo
nt dev
te
Con
Recognize Measurement Error and
Accuracy
• Measurement error is the difference between the cs true value of
ti
tis that value.
something and the numbers used to represent
D/
S ta
, A RC
20 23
a ©
u d
• Accuracy is the degree to which E K the value being measured is close to
h
ram
the object’s actual measurement. d by
I k
lo pe
d eve
te nt
on
• It is the degree to which the measured value is similar to a reference
C

or genuine value.
Recognize Accuracy Formula

• The accuracy formula helps one to understand measurement cs errors. It


tisti
is considered to be highly accurate and error-free D/S ta if the measured
RC
value is equal to the real value. Error 0rate
23 , A and accuracy are mutually
2
exclusive. hud
a ©
EK
I kram
y
pe db
e lo
nt dev
te
Con
Comparison between accuracy and precision

sti cs
S tati
D/
C
, AR
2 023
u d a©
K h
E
I kram
y
pe db
e lo
nt dev
te
Con
Discuss The Process of Data Analysis
stic
s
S tati
D /
C
3 , AR
The process of data analysis, or alternately,
2 02 data analysis steps,
involves gathering all the information,
u d a© processing it, exploring
h
the data, and using it to find E K patterns and other insights
I kram
d by
lo pe
d eve
te nt
Con
Review/ Assessment
• Kindly go through the Practice Problems 1 available on Blackboard for
the questions related to the topic. cs
ati sti
D/ St
C
, AR
2 023
u d a©
K h
E
I kram
y
pe db
e lo
nt dev
te
Con
cs
Feedback CD/ S tati sti

, AR
Q/A d a©
2 023
K h u
E
I kram
y
pe db
e lo
nt dev
te
Con

You might also like