Understanding Data and Its Types-Lecture 1
Understanding Data and Its Types-Lecture 1
ti sti cs
Ikram E Khuda / Sta
RCD
,A
2 023
u d a©
K h
E
I kram
y
pe db
e lo
nt dev
te
Con
Recognize the Data
• In mathematics there are two types of experiments
• Deterministic and
• Non deterministic (also called random)
sti cs
S tati
• By deterministic it is meant that which can be determined. Hence those experiments D/
whose outputs can be found using a derived mathematical
RC
formula and with no variations in it is called deterministic systems
23 ,A
©20
u d a
h Hence those experiments whose outputs cannot be found because there is
• EK
By non deterministic it is meant that which can not be determined.
ra m or random systems.
no available mathematical formula is called non deterministic
yI k
b
o p ed
• Output of random systems are associated evelwith uncertainties …doubt and mistrust.
e ntd
t
Con
• Data is the outcome variable of a random system
• Data determines the facts and figures which contain information in them, i.e. data is raw form of information
• Information is an entity that resolves problems containing uncertainty. Resolve is different from solve. Resolves means to brining the problem
to an end or to its conclusion. Solve is the process of finding an answer.
sti cs
S tati
/
Identify a random2023system ? , AR
CD
a ©
u d
E Kh
I kram
y
pe db
e lo
nt dev
te
Con
Example 1
It is 2.1
Example 2
• Consider the following System B
sti cs
S tati
D/
C
Input as X=1.2,1.2,..,1.2 , AR Output as Y=2.1,1.8,…,2.3,1.7,2.2,2.1..
B 2 023
u d a©
Kh
E
m
y Ikra
d b
What is the eoutput of System B for an input of 1.2?
o p
evel
e ntd ?
nt
Co
It is .. How do I know?
Articulate Data Analysis
• How can we find the output variable Y in System B?
Approach 1: cs
sti
Open up the system and trace the input to output flow through itSto tatifind the output
D/
This an engineering approach C
3 , AR
2 02
Approach 2:
u d a©
h
Use the data or the random variable (short: rv) E K to understand the System B to an extent that we are able to
characterize System and be able to predict I kramthe output of System B
d by
lo pe
• Approach 2 in short we dcalleveas Data Analysis
e t
nby
t
• We do Data Analysis applying statistical tools
Con
• By using Machine Learning tools. Machine Learning uses data analysis (i.e. statistical) and computer algorithms to
imitate the way humans would understand working of a random system (e.g. System B)
• In any case the whole purpose of data analysis is to do predictions and hence perform decision making.
or
𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑐𝑜𝑚𝑒=𝑚𝑎𝑐h𝑖𝑛𝑒 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔𝑚𝑜𝑑𝑒𝑙 ± 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑡h𝑒 𝑚𝑜𝑑𝑒𝑙
sti cs
S tati
D/
C
, AR
2 023
u d a©
h
ram
E K
statistical models +
I k
pe db
y computer algorithms
e lo
nt dev
te
Con
Machine Learning
Discuss Data Analysis
• With the usage of computer algorithms, bigger data with better
accuracy in lesser time could be achieved.
sti cs
S tati
• It also enable to include heuristic understanding RCD/ of data; which is not
,A
conceivable using exact mathematical ©20 models used in statistics.
23
u d a
h
EK
I kram
• This is the essence ofohuman d by intelligence.
p e
evel
e ntd
t
Con
For example while driving a car, a better driver is one who is trained/
experienced with the car and the road conditions rather then the one
who is making calculations with road turns or angles of road deviations !
Compare Data Types
• There is no one definition of data types
• In terms of mathematical representations
• Integer ( no decimal values)
sti cs
• Continuous (with decimal values) S tati
D/
C
, AR
• In terms of content 2 023
• Numeric or quantitative (numbers integer or continuous)
u d a©
h
EK
• String or qualitative ( alphabetic or alphanumeric)
I kram
• In terms of count d by
lo pe
• Discrete (finite count), can e numeric or string
evbe
• Continuous (infinite t d
ncount)
te , always numeric
Con
• In terms of levels of measurements
• Nominal (just labels, can be discrete, integers, quantitative or qualitative)
• Ordinal (showing order or ranks among data values, can be discrete, integers, quantitative or qualitative)
• Interval (always Continuous but includes those variables where zero is not defined)
• Ratio (always Continuous but includes those variables where zero is defined)
Interval and Ratio are together also called as Scale.
Compare Data Types
• Data types can also be described by the way they are analysised
• Sample Data
• Population Data sti cs
S tati
D/
C
, AR
2 023
• If whole data is used for data analysis
u d a © ,then it is called a population
h
data EK
ram
by Ik
pe d
e lo
dev
nt is used from some data set then it is called a
• If a fraction ofodata
te
C n
sample data
sti cs
S tati
D/
C
, AR
23
2 0
Population Data
a ©
d
E K hu
I kram Sample
y
pe db Data
e lo
nt dev
te
Con
Compare Data Types
• Data is also classified in terms of its dimensions
• Low dimension data is one with low features
• High dimension data is one with higher features sti cs
S tati
D/
C
, AR
2 023
High-dimensional data are defined as data in which the number
u d a ©of features (variables observed), p,
h
are close to or larger than the number of observations (or E Kdata points), n.
I kram
The opposite is low-dimensional data in which y number of observations, n,
bthe
pe d
far outnumbers the number of features,evp. e lo
nt d
te
A related concept is wide data, Con which refers to data with numerous features irrespective of the number of observations
(similarly, tall data is often used to denote data with a large
number of observations).
Analyses of high-dimensional data require consideration of potential problems that come from having more features than
observations.
Example 2 (cont’d)
• In Example 2, the output variable Y is called the data or random
variable (rv) sti cs
S tati
D/
• Every single value of Y is called an event. C
3 , AR
2 02
• Every event has a chance of occurrence. u d a©
h
EK
• This chance of occurrence I ramcalled the Probability of an event or
is
k
d by
o pe
• Mathematically this
t d eveprobability can be calculated as:
l
nten
Co 𝑇𝑜𝑡𝑎𝑙 𝑁𝑜. 𝑜𝑓 𝐹𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝐶𝑎𝑠𝑒𝑠 𝑜𝑓 𝑎𝑛 𝐸𝑣𝑒𝑛𝑡
𝑃 ( 𝐸 )=
𝑇𝑜𝑡𝑎𝑙 𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑝𝑎𝑐𝑒
Classification of Probability
The way a sample space is defined categorizes whether the probability is
being theoretically calculated or experimentally.
sti cs
There are two different types of probability that we S tatioften talk about: theoretical
probability and experimental probability. D/
C
3 , AR
2 02
Theoretical probability describes how likelydaan © event is to occur. We know that a coin is
u
Kh theoretical probability of getting heads is 1/2.
equally likely to land heads or tails, so Ethe
I kram
Experimental probability describes d by how frequently an event actually occurred in an
o p e
experiment. So if you tossed evel a coin 20 times and got heads 8 times, the experimental
probability of getting e ntd
heads would be 8/20, which is the same as 2/5, or 0.4, or 40%.
ont
C
The theoretical probability of an event will always be the same, but the experimental
probability is affected by chance, so it can be different for different experiments.
The more trials you carry out (for example, the more times you toss the coin), the closer
the experimental probability is likely to be to the theoretical probability.
Rules of Probability
• Probability of an event ,i.e. is always a real no. between 0 and 1
sti cs
S tati
D/
C
, AR
023
• Sum of the probability of all the mutually u d a©
2
exclusive
h
EK
I kram
d by
lo pe
d eve
te nt
Con
Example 3 sti cs
tati
(Extrapolate what we have , A RCD/ S
discussed so
023
far) EK
h u d a©
2
I kram
y
pe db
e lo
nt dev
te
Con
string, discrete and nominal Data Types
sti cs
S tati
D/
C
, AR
2 023
u d a©
K h
E
I kram
y
pe db
e lo string, discrete and nominal
nt dev
te
Con
sti cs
S tati
D/
C
, AR
2 023
u d a©
K h
E
I kram numeric, continuous and ratio/ scale
y
pe db
e lo
nt dev
te
Con
Data Types
integers, discrete and ordinal
sti cs
S tati
D/
C
, AR
2 023
u d a© string and qualitative
K h
E
I kram
y
pe db
e lo
nt dev
te
Con
Data Types
5 variables
string, discrete and ordinal
sti cs
S tati
D/
C
, AR
2 023
u d a©
K h
E
I kram
y
pe db
e lo
nt dev
te
Con
Example of Data in a Data
Editor
sti cs
S tati
D/
C
, AR
2 023
u d a©
K h
E
I kram
y
pe db
e lo
nt dev
te
Con
low dimensional data
sti cs
S tati
D/
C
, AR
2 023
u d a©
K h
E
I kram
y
pe db
e lo
nt dev
te
Con
Recognize Measurement Error and
Accuracy
• Measurement error is the difference between the cs true value of
ti
tis that value.
something and the numbers used to represent
D/
S ta
, A RC
20 23
a ©
u d
• Accuracy is the degree to which E K the value being measured is close to
h
ram
the object’s actual measurement. d by
I k
lo pe
d eve
te nt
on
• It is the degree to which the measured value is similar to a reference
C
or genuine value.
Recognize Accuracy Formula
sti cs
S tati
D/
C
, AR
2 023
u d a©
K h
E
I kram
y
pe db
e lo
nt dev
te
Con
Discuss The Process of Data Analysis
stic
s
S tati
D /
C
3 , AR
The process of data analysis, or alternately,
2 02 data analysis steps,
involves gathering all the information,
u d a© processing it, exploring
h
the data, and using it to find E K patterns and other insights
I kram
d by
lo pe
d eve
te nt
Con
Review/ Assessment
• Kindly go through the Practice Problems 1 available on Blackboard for
the questions related to the topic. cs
ati sti
D/ St
C
, AR
2 023
u d a©
K h
E
I kram
y
pe db
e lo
nt dev
te
Con
cs
Feedback CD/ S tati sti
, AR
Q/A d a©
2 023
K h u
E
I kram
y
pe db
e lo
nt dev
te
Con