0% found this document useful (0 votes)
10 views135 pages

Lecture 1 2024

The document outlines a course on data analysis for engineers using SPSS, covering fundamental concepts in statistics, levels of measurement, and basic statistical methods. It details descriptive and inferential statistics, the importance of sampling, and various statistical techniques including hypothesis testing and ANOVA. Additionally, it discusses the significance of understanding data types and measurement levels for effective statistical analysis.

Uploaded by

fortunaokala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views135 pages

Lecture 1 2024

The document outlines a course on data analysis for engineers using SPSS, covering fundamental concepts in statistics, levels of measurement, and basic statistical methods. It details descriptive and inferential statistics, the importance of sampling, and various statistical techniques including hypothesis testing and ANOVA. Additionally, it discusses the significance of understanding data types and measurement levels for effective statistical analysis.

Uploaded by

fortunaokala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 135

Introduction to Data Analysis for Engineers

WITH SPSS

Prof. rer. Nat. tech. Innocent Ndoh Mbue


(Eco-informatics and Environmental Management)

14-Jan-25 1
Course Outline
A. Overview of the basic concepts in Statistics
 Some Key Definitions
 Descriptive and Inferential Statistics
 Data Presentation

B. Levels of Measurement

C. Levels of Measurement
D. Basic Statistics for Data Science
. Measures of central tendencies and of dispersions
.The Normal Distribution
. Hypothesis Testing
. Contingency Analysis /Chi squared statistics
. Correlation and Regression (inferences)
. One way ANOVA
E. Data conception, Collection, Preparation, Entry, and Analysis using SPSS
14-Jan-25 2
The circular process of research:

Questions arise about


a phenomenon

Conclusions are drawn A decision is made to


from the analysis collect data

A decision is made as
The data is summarized how to collect the data
and analyzed

The data is
14-Jan-25
collected 3
14-Jan-25 4
14-Jan-25 5
14-Jan-25 6
14-Jan-25 7
14-Jan-25 8
Descriptive & Inferential Statistics
Descriptive Statistics Inferential Statistics

• Organize • Generalize from


samples to pops
• Summarize
• Hypothesis testing
• Simplify • Relationships
• Presentation of among variables
data

Describing data Make predictions


Descriptive Statistics

3 Types

1. Frequency Distributions 3. Summary Stats


Describe data in just one
number

2. Graphical Representations

Graphs & Tables


When use Descriptive Statistics

• When you describe your databases or the relationships


between your variables.

• when a researcher wants to gain a better understanding of


a topic.

• The idea behind this type of research is to study


frequencies, averages, and other statistical calculations.

• In medical and nursing situations.

14-Jan-25 11
14-Jan-25 12
14-Jan-25 13
14-Jan-25 14
14-Jan-25 15
Terminology
Populations & Samples

Population: the complete set of individuals, objects or scores of


interest.
Often too large to sample in its entirety
It may be real or hypothetical (e.g. the results from an
experiment repeated ad infinitum)

Sample: A subset of the population.


A sample may be classified as random (each member has
equal chance of being selected from a population) or
convenience (what’s available).
Random selection attempts to ensure the sample is
representative of the population.

1/14/2025 Prof. Dr. Ndoh Mbue 16


14-Jan-25
14-Jan-25 16 16
Sample vs. Population

Population Sample
1/14/2025 Prof. Dr. Ndoh Mbue
14-Jan-25 17
In statistics:

• One draws conclusions about the


population based on data collected
from a sample

14-Jan-25 18
Reasons:

• Cost

It is less costly to collect data from a


sample than the entire population

Accuracy

14-Jan-25 19
Accuracy
Data from a sample sometimes leads
to more accurate conclusions then data
from the entire population

Costs saved from using a sample can


be directed to obtaining more accurate
observations on each case in the
population
14-Jan-25 20
Lecture I

Levels of Measurement

14-Jan-25 21
What is Measurement?
 The assignment of numerals to objects or events according to rules.

 Numerals are labels that have no inherent meaning, for example zip
codes, or automobile license plates.

 Numbers are numerals that have quantitative meaning and can be


analyzed, for example, age.

 ■ The rules for assigning labels to properties of variables are the most
important components of measurement, because the result of poor rules is
meaningless outcomes.

 ■ Concepts often cannot be measured directly, e.g., “intelligence,” so what is


usually measured are indicators of constructs, such as speed, logic, verbal
skill, etc.
1/14/2025 Prof. Dr. Ndoh Mbue 22
14-Jan-25 22
Levels of Measurement
■ Four levels of measurement have been identified:

 Nominal
 Ordinal
 Interval
 Ratio

These levels differ in how closely they approach the structure of the number
system we use.

 Understanding the level of measurement of variables used in research is


important because the level of measurement determines the types of statistical
analyses that can be conducted.

 The conclusions that can be drawn from research depend on the statistical
analysis used. 1/14/2025 Prof. Dr. Ndoh Mbue 23
14-Jan-25 23
Possible data types and levels of measure.

***The type of data you have dictates the type of analysis you will perform. 24

14-Jan-25 24
Nominal Scale

■ In nominal measurement, all observations in one category are alike on some


property and differ from the members in the other category on that property (e.g.,
sex, martial status).

■ Ordering of categories does not exists. We cannot say one category is better or
worse, or more or less than another.

Nominal — Numbers used as Names

■ Basic Empirical Operations


• Determination of equality
■ Permissible Statistics
• Number of cases
• Mode
• Contingency correlation
■ Examples
• Numbers on football jerseys
• Assignment of type or model numbers to classes
1/14/2025 Prof. Dr. Ndoh Mbue 25
14-Jan-25 25
Ordinal Scales: It measures a variable in terms of magnitude, or rank.

Example:
 socioeconomic
 class
 grades
 preferences

• Ordinal scales tell us relative order, but give us no information regarding


differences between the categories.

• For example, runners in the 100 meter dash finish 1st, 2nd, 3rd etc. Is the
number of seconds between 1st and 2nd place the same as those between 2nd
and 3rd place? Certainly not necessarily.

1/14/2025 Prof. Dr. Ndoh Mbue 26


14-Jan-25 26
… Ordinal Scale: Rank order data

 Most questionnaires use Likert type items. For example, we may ask teachers
about their job satisfaction.
 Asking whether a teachers is very satisfied, satisfied, neutral, dissatisfied, or very
dissatisfied is using an ordinal scale of measurement.

1/14/2025 Prof. Dr. Ndoh Mbue 27


14-Jan-25 27
Interval Scales
 This scale has the properties of the nominal and ordinal scales,
but here the magnitude between the consecutive intervals are
equal. Temperature is the example that is usually given to
illustrate an interval scale.

 When distance between attributes has meaning, for example,


temperature (in Fahrenheit) - distance from 30-40 is same as
distance from 70-80

 * Interval scales do not have a true zero. 0 degrees do not mean


the absence of heat (although it might feel like it).

" If a change from 1 to 2 has the same strength as a 4 to 5, then we would call it an
interval level measurement (if not, then it’s just an ordinal qualitative
measurement).
28

14-Jan-25 28
Ratio Scales
 Ratio scales have all of the characteristics of the nominal,
ordinal and interval scales. In addition, however, ratio scales
have a true zero.
 There are true ratios. One can use all mathematical
operations on this scale.
 Examples:
weight
height
time
distance
* 10 miles is twice as long as 5 miles. 0 miles is no distance.

• In our descriptions of data in this course, we will assume that we are using
ratio scales most of the time. We call these PARAMETRIC STATISTICS.

• However, there will be times when all we have to work with are ordinal
scales. When we use these scales, our data will be rank ordered. We will
call these NONPARAMETRIC STATISTICS.
1/14/2025 Prof. Dr. Ndoh Mbue
14-Jan-25 29
Types of Variables
• A variable is a characteristic that changes or varies over time and/or for
different individuals or objects under consideration.
• Variables are the quantities measured in a sample. They may be classified as:

Quantitative
Qualitative
( Numerical)
(Categorical)

Nominal Ordinal
e.g. gender, ranked e.g. mild,
blood group moderate or Discrete Continuous
Hair color severe weather

- No. of level II -pH of a sample


students in 2022 - Elevation
-Age 30
14-Jan-25
-Income category.
30
Variables
• Variables can be further classified as:
– Dependent/Response. Variable of primary interest (e.g. blood pressure in
an antihypertensive drug trial). Not controlled by the experimenter.

– Independent/Predictor
• called a Factor when controlled by experimenter. It is often nominal
(e.g. treatment)
• Covariate when not controlled.

• If the value of a variable cannot be predicted in advance then the variable is


referred to as a random variable

1/14/2025 Prof. Dr. Ndoh Mbue 3131


14-Jan-25 31
Example – Passing an exam
• Suppose we are interested in how Success is influenced
by the following factors:
– Study time
– Punctuality in Lectures
– Participation during lectures
– Sex
– Age
Then
• Success is the dependent variable, and
Study time (ST)
Punctuality in Lectures (P@L)
Participation during lectures (Partn)
Sex
Age
14-Jan-25 Are the independent variables 32
dependent Response
variable

independent predictor
variable

14-Jan-25 33
14-Jan-25 34
Basic Statistics for Data
Science

14-Jan-25 35
Parameters & Statistics

• Parameters: Quantities that describe a population characteristic.


They are usually unknown and we wish to make statistical
inferences about parameters. E.g., µ, ᵟ

• Statistic: Quantities that describe a sample (mean, median…).

• Descriptive Statistics: Quantities and techniques used to describe


a sample characteristic or illustrate the sample data e.g. mean,
standard deviation, box-plot

1/14/2025 36
14-Jan-25
How many variables have you measured?

• Univariate data: One variable is measured on a single


experimental unit.

• Bivariate data: Two variables are measured on a single


experimental unit.

• Multivariate data: More than two variables are measured on a


single experimental unit.

1/14/2025 Prof. Dr. Ndoh Mbue 37


14-Jan-25
Statistics

1/14/2025 Prof. Dr. Ndoh Mbue


14-Jan-25 38
A Taxonomy of Statistics

14-Jan-25 39
Indicateurs de localisation (ou de tendance centrale)

Mean
Let y denote a quantitative variable, with
observations y1 , y2 , y3 , … , yn

a. Describing the center

Median: Middle measurement of ordered sample

Mean:
y1  y2  ...  yn yi
y 
14-Jan-25
n n 40
Sample Mean

The arithmetic mean (or, simply, mean) is computed by summing all the
observations in the sample and dividing the sum by the number of
observations.

For a sample of five household incomes, 6000, 10,000, 10,000, 14000,


50,000 the sample mean is,

6000 + 10000 + 10000 + 14000 + 50000


X = = 18000
5

14-Jan-25 41
Median

In a list ranked from smallest measurement to the highest, the


median is the middle value

In our example of five household incomes, first we rank the


measurements

6,000 10,000 10,000 14,000 50,000

Sample Median is 10,000

14-Jan-25 42
Example of Median

Measurements Measurements
Ranked
• Median: (4+5)/2 = 4.5
x x
3 0 • Notice that only the two
5 1
central values are used
5 2
1 3
in the computation.
7 4
2 5 • The median is not
6 5
sensible to extreme
7 6
0 7
values
4 7
40 40

14-Jan-25 43
Example of Mode

Measurements

x
3
5
• In this case the data have tow
5 modes:
1 • 5 and 7
7
2 • Both measurements are
6 repeated twice
7
0
4

14-Jan-25 44
Example of Mode

Measurements
x
3
• Mode: 3
5
1
1
4
7
• Notice that it is possible
3 for a data not to have any
8 mode.
3

14-Jan-25 45
Properties of mean and median

• For symmetric distributions, mean = median

• For skewed distributions, mean is drawn in direction of longer tail, relative


to median

• Mean valid for interval scales, median for interval or ordinal scales

• Mean sensitive to “outliers” (median often preferred for highly skewed


distributions)

• When distribution symmetric or mildly skewed or discrete with few values,


mean preferred because uses numerical values of observations

14-Jan-25 46
In The Presence Of Outliers
Q: Do outliers affect the Mean and Median?

Consider the list on numbers from 1 through 9 :


1, 2, 3, 4, 5, 6, 7 ,8 ,9

The Mean is : 5 The Median is : 5

What if we put the number 100 at the end of the list :


1, 2, 3, 4, 5, 6, 7 ,8 ,9, 100

The Mean is14.5 The Median is :5.5


Conclusion: Outliers affect the Mean much more than the Median !

14-Jan-25 47
Measures of Central Tendency: Which Measure to Choose?

The mean is generally used, unless extreme values (outliers) exist.

The median is often used, since the median is not sensitive to extreme
values. For example, median home prices may be reported for a region;
it is less sensitive to outliers.

In some situations it makes sense to report both the mean and the
median.

14-Jan-25 48
Common Distributional Shapes:

• A symmetric distribution is one where both sides about the


center line are approximately mirror images of each other.

• A skewed distribution is one where one side of the center


line contains more data than the other.
– Skewed to the right: The right side of the histogram
extends much farther than the left side.

– Skewed to the left: The left side of the histogram


extends much farther than the right side.

14-Jan-25 49
49
Mean, Median, Mode

Mean
Mean Mode Mode Mean
Median
Median Mode Median
Negatively Symmetric Positively
Skewed (Not Skewed) Skewed

…mean < median < mode


(a “negatively skewed distribution”)

…mean > median > mode


(a “positively skewed distribution”)

14-Jan-25 50
…Skewness
• Measures asymmetry of data
– Positive or right skewed: Longer right tail
– Negative or left skewed: Longer left tail

Let x1 , x2 ,...xn be n observations. Then,


n
n  ( xi  x )3
Skewness  i 1
3/ 2
 n
2
  ( xi  x ) 
 i 1 
14-Jan-25 51
Interpretation

• Skewness: indicator used in distribution analysis as a sign of asymmetry


and deviation from a normal distribution.

Interpretation:

• Skewness > 0 - Right skewed distribution - most values are concentrated on


left of the mean, with extreme values to the right.
• Skewness < 0 - Left skewed distribution - most values are concentrated on
the right of the mean, with extreme values to the left.
• Skewness = 0 - mean = median, the distribution is symmetrical around the
mean.

14-Jan-25 52
Kurtosis

• Measures peakedness of the distribution of data. The


kurtosis of normal distribution is 0.

Let x1 , x2 ,...xn be n observations. Then,


n
n ( xi  x ) 4
Kurtosis  i 1
2
3
 n 2
  ( xi  x ) 
 i 1 
Interpretation

• Kurtosis > 3 - Leptokurtic distribution, sharper than a normal distribution,


with values concentrated around the mean and thicker tails. This means
high probability for extreme values.

• Kurtosis < 3 - Platykurtic distribution, flatter than a normal distribution


with a wider peak. The probability for extreme values is less than for a
normal distribution, and the values are wider spread around the mean.

• Kurtosis = 3 - Mesokurtic distribution - normal distribution for example.

14-Jan-25 54
Interpreting Graphs: Outliers

No Outliers Outlier

• Are there any strange or unusual measurements that stand


out in the data set?

14-Jan-25 55
…Descriptive Statistics

Measures of Dispersion or Variability

• Range (present highest and lowest value in a distribution.


The difference between these values is the range)

• Variance

• Standard deviation (the square root of the variance)

14-Jan-25 56
14-Jan-25 57
Range

The spread, or the distance, between the lowest and highest


values of a variable.

To get the range for a variable, you subtract its lowest value from
its highest value.
Class A--IQs of 13 Students Class B--IQs of 13 Students
102 115 127 162
128 109 131 103
131 89 96 111
98 106 80 109
140 119 93 87
93 97 120 105
110 109
14-Jan-25 58
Class A Range = 140 - 89 = 51 Class B Range = 162 - 80 = 82
The sample variance of the n observations is

( yi  y ) ( y1  y )  ...  ( yn  y )
2 2 2
s 
2

n 1 n 1

The standard deviation s is the square root of the variance,

s  s 2

14-Jan-25 59
14-Jan-25 60
Variance and standard deviation

Variance: s 
2  ( X  X ) 2


ss

SS
n 1 n 1 df
• ‘Sum of Squares’ = SS
• degrees of freedom (df) = n-1

Standard
Deviation of sample: s 
(X  X ) 2


ss

SS
n 1 n 1 df
Standard Deviation for whole
population:    ( x   ) 2

14-Jan-25 N 61
Example
• For those in the student sample who attend religious
services at least once a week (n = 9 of the 60),
• y = 2, 3, 7, 5, 6, 7, 5, 6, 4

y  5.0,
(2  5) 2
 (3  5) 2
 ...  (4  5) 2
24
s 
2
  3.0
9 1 8
s  3.0  1.7

For entire sample (n = 60), mean = 3.0, standard deviation = 1.6, tends
to have similar variability but be more liberal
14-Jan-25 62
• Properties of the standard deviation:
• s  0, and only equals 0 if all observations are equal
• s increases with the amount of variation around the mean
• Division by n - 1 (not n) is due to technical reasons (later)
• s depends on the units of the data (e.g. measure euro vs $)
•Like mean, affected by outliers

14-Jan-25 63
Measures of position

pth percentile: p percent of observations below it,


(100 - p)% above it.

 p = 50: median
 p = 25: lower quartile (LQ)
 p = 75: upper quartile (UQ)

 Interquartile range IQR = UQ - LQ


14-Jan-25 64
Mesures de la dispersion

Calculer l'étendue et l'écart interquartile

Le quartile inférieur, ou premier quartile (Q1), est la valeur au-dessous de laquelle


se trouvent 25 % des données lorsqu’elles sont arrangées en ordre croissant.

Le quartile supérieur, ou troisième quartile (Q3), est la valeur au-dessous de laquelle


se trouvent 75 % des données arrangées en ordre croissant.

La médiane est considérée comme le second quartile (Q2). L’écart interquartile est
la différence entre le quartile supérieur et le quartile inférieur.

L’écart semi-interquartile est la moitié de l’écart interquartile.

14-Jan-25 65
… Calculer l'étendue et l'écart interquartile

Exemple 1 – Étendue et écart interquartile d’un ensemble de données


Identifiez les quartiles de l’ensemble de données suivant :

6, 47, 49, 15, 43, 41, 7, 39, 43, 41, 36.

Pour commencer, vous devez arranger les valeurs en ordre croissant. Ce faisant, vous
pouvez donner un rang aux points de données.

Le point correspondant à la plus petite valeur aura le rang 1, le point correspondant à


la seconde plus petite valeur aura le rang 2 et ainsi de suite.

14-Jan-25 66
Il vous faut ensuite trouver le rang de la médiane. Comme vu à
Rang Valeur
la section sur la médiane, lorsque le nombre de points est
impair, la médiane correspond à la valeur du point de rang
1 6 (n + 1) ÷ 2 = (11 + 1) ÷ 2 = 6
2 7
La médiane est le point de données de rang 6. Il y a donc 5
3 15
valeurs de chaque côté.
4 36
5 39 Vous devez séparer la moitié inférieure à la médiane en 2. Le
quartile inférieur sera donc la valeur du point de rang (5 +1) ÷2
6 41
= 3, ce qui donne Q1=15. La moitié supérieure à la médiane est
7 41 également séparée en 2. Le quartile supérieur sera la valeur du
8 43 point de rang 6 + 3 =9, ce qui donne Q3 = 43.
9 43
Une fois les quartiles trouvés, il est facile de mesurer la
10 47 dispersion. L’écart interquartile est Q3 - Q1, ce qui donne 28
11 49 (43-15). L’écart semi-interquartile est 14 (28 ÷ 2) et l’étendue
est de 43 (49-6).
14-Jan-25 67
Exercice: On a relevé les tailles en centimètres (cm) de 24 élèves d’une classe
d’un collège
taille en cm 151 153 155 158 160 165
effectif 2 5 8 5 3 1

Compléter le tableau des effectifs cumulés croissants

Compléter les propositions suivantes


a) La troisième valeur de la série des tailles est….
b) La médiane de la série est ……
c) Le premier quartile de la série est ……
d) Le troisième quartile de la série est …….
e) La plus grande valeur pour laquelle 25% des valeurs de la série sont supérieures
ou égales est ……
f) La plus petite valeur pour laquelle 25% des valeurs de la série sont inférieures ou
égales est …..
g L'écart interquartile est égal à …….

14-Jan-25 68
Correction

a) La troisième valeur de la série des tailles est 153


b) La médiane de la série est 155
c) Le premier quartile de la série est 153
d) Le troisième quartile de la série est 158
e) La plus grande valeur pour laquelle 25% des valeurs de la série sont supérieures
ou égales est 158
f) La plus petite valeur pour laquelle 25% des valeurs de la série sont inférieures
ou égales est 153
g) L'écart interquartile est égal à 5

14-Jan-25 69
Inter-quartile range

• The Median divides a distribution into two halves.


• The first and third quartiles (denoted Q1 and Q3) are
defined as follows:
– 25% of the data lie below Q1 (and 75% is above Q1),
– 25% of the data lie above Q3 (and 75% is below Q3)
• The inter-quartile range (IQR) is the difference between the
first and third quartiles, i.e.
IQR = Q3- Q1
• Quartile deviation or semi interquartile range = IQR/2
• Coefficient of quartile deviation = Q3- Q1 / Q3+ Q1

14-Jan-25 70
Methods of Variability Measurement

In the following example Q1= ((15+1)/4)1 =4th observation of the data.


The 4th observation is 11. So Q1 is of this data is 11.

An example with 15 numbers


3 6 7 11 13 22 30 40 44 50 52 61 68 80 94
Min Q1 Q2 Q3 Max
The first quartile is Q1=11. The second quartile is Q2=40 (This is
also the Median.) The third quartile is Q3=61.

Inter-quartile Range(IQR): Difference between Q3 and Q1. Inter-


quartile range of the previous example is 61- 40=21. The middle half
of the ordered data lie between 40 and 61.

14-Jan-25 71
Which Measure To Use ?
Q: When is the mean better than median? When is the five number summary
better than the standard deviation?

Rules Of Thumb
A1: If outliers appear, or if your distribution is skewed, then the mean could be
affected, so use the median and the five number summary.

A2: If the distribution is reasonably symmetric and is free of outliers, then the
mean and standard deviation should be used.

14-Jan-25 72
Coefficient of Variation

• The coefficient of variation (CV) or relative standard


deviation (RSD) is the sample standard deviation expressed as
a percentage of the mean, i.e.

s
CV    100%
x
• The CV is not affected by multiplicative changes in scale
• Consequently, a useful way of comparing the dispersion of
variables measured on different scales

14-Jan-25 73
Accuracy

• Accuracy: the closeness of the measurements to the


“actual” or “real” value of the physical quantity.

– Statistically this is estimated using the standard error of


the mean

14-Jan-25 74
Descriptive Statistics :Tables and Graphs

Summarizing Data:

 Central Tendency (or Groups’ “Middle Values”)


Mean
Median
Mode

– Variation (or Summary of Differences Within Groups)


• Range
• Interquartile Range
• Variance
• Standard Deviation
Descriptive Statistics:Tables and Graph

Frequency Distributions
• An (Empirical) Frequency Distribution or Histogram
for a continuous variable presents the counts of
observations grouped within pre-specified classes or
groups

• A Relative Frequency Distribution presents the


corresponding proportions of observations within the
classes

• A Barchart presents the frequencies for a categorical


variable

14-Jan-25 76
…Descriptive Statistics

Frequency Table
• Generally, the first approach to examining your data.
• Identifies distribution of variables overall
• Identifies potential outliers
– Investigate outliers as possible data entry errors
– Investigate a sample of others for data entry errors

14-Jan-25 77
Example
• A bag contains 25 candies:
• Raw Data:
m m m m m m m m m m
m m m m m m m m m m
m m m m m
• Statistical Table:

Color Tally Frequency Relative Percent


Frequency
Red mmmmm 5 5/25 = .20 20%
Blue mmm 3 3/25 = .12 12%
Green mm 2 2/25 = .08 8%
Orange mmm 3 3/25 = .12 12%
Brown mmmmmmmm 8 8/25 = .32 32%
Yellow mmmm 4 4/25 = .16 16%
14-Jan-25 78
Table 1 in a paper

Describe your study population in a frequency table

Table 1: Table Title


Name of variable
Frequency
(Units of variable) % Mean (SD)
(n)

-
- Categories
-

Total

14-Jan-25 79
Example

The ages of 50 tenured faculty at a state university.


• 34 48 70 63 52 52 35 50 37 43 53 43 52 44
• 42 31 36 48 43 26 58 62 49 34 48 53 39 45
• 34 59 34 66 40 59 36 41 35 36 62 34 38 28
• 43 50 30 43 32 44 58 53

• We choose to use 6 intervals.


• Minimum class width = (70 – 26)/6 = 7.33
• Convenient class width = 8
• Use 6 classes of length 8, starting at 25.

14-Jan-25 80
ETABLIR DES CLASSES
Règle de Sturge :

Règle de Yule :

Pour déterminer les intervalles et les bornes de classe :


• La borne inférieure d’une classe est la plus petite valeur admise dans la classe
• La borne supérieure d’une classe est au contraire la plus grande valeur admise
dans la classe
• L’intervalle de classe se calcule approximativement avec la formule suivante :

L’indice de classe correspond à la valeur centrale de la classe.

14-Jan-25 81
Age Tally Frequency Relative Percent
Frequency
25 to < 33 1111 5 5/50 = .10 10%
33 to < 41 1111 1111 1111 14 14/50 = .28 28%
41 to < 49 1111 1111 111 13 13/50 = .26 26%
49 to < 57 1111 1111 9 9/50 = .18 18%
57 to < 65 1111 11 7 7/50 = .14 14%
65 to < 73 11 2 2/50 = .04 4%

14-Jan-25 82
Describing the
Distribution

Shape? Skewed right


Outliers? No.
What proportion of the tenured
faculty are younger than 41? (14 + 5)/50 = 19/50 = .38
What is the probability that a
randomly selected faculty member
(9+ 7 + 2)/50 = 18/50 = .36
is 49 or older?

14-Jan-25 83
EXEMPLE
Prenons la longueur totale du crâne (mm) pour un sous échantillon de 60 souris
sylvestres adultes (I, II et III), tiré d’un échantillon de 122 souris de Landry (2000).
L’effectif de l’échantillon est de n=60.
 Combien de classes ?
 Selon les règles de Sturge et Yule, nous devrons donc définir 7 classes.

 Quelle sera l’étendue des classes ?


 L’étendue de variation de la variable est de 0,5mm.

14-Jan-25 84
… Frequency Distributions

Categorize on the basis of more that one variable at same time


CROSS-TABULATION

total

Democrats 24 1 25

Republican 19 6 25

Total 43 7 50
14-Jan-25 85
Graphs:Organizing Data
Diagrammes sectoriels (ou en camemberts)

Bar Chart

Pie Chart

14-Jan-25 86
Pie Charts For Qualitative Data

A Pie-Chart (also called sector diagram), is a graph


consisting of a circle divided into sectors whose areas
are proportional to the various parts into which whole
quantity is divided.

Pie Chart
Expenditure (in 100 rupees)

Food
Clothing
Rent
Fuel
Misc.

14-Jan-25 87
Pie Charts For Qualitative Data

Example: Represent the expenditures on various


items of a family by a pie chart.

Items Expenditure
(in 100 FCFA)
Food 50
Clothing 30
Rent 20
Fuel 15
Misc. 35
Total 150

14-Jan-25 88
… Pie Charts For Qualitative Data

Steps for Constructing Pie-Chart:


Step 1: Draw a circle of any radius

14-Jan-25 89
… Pie Charts For Qualitative Data

Steps for Constructing Pie-Chart:


Step 2: Find angle of each sector corresponding to share of
each component.
Angle of sector=(component part/whole quantity) * 360

Items Expenditure (in 100 rupees) Angles of sector (in Degrees)


Food 50 (50/150)*360=1200
Clothing 30 (30/150)*360=720
Rent 20 (20/150)*360=480
Fuel 15 (15/150)*360=360
Misc. 35 (35/150)*360=840
Total 150 3600
14-Jan-25 90
… Pie Charts For Qualitative Data

Steps for Constructing Pie-Chart:


Step 3: Divide the circle into various sectors by measuring
the corresponding angle via protector.

Pie Chart
Items Expenditure Angles of sector Expenditure (in 100 FCFA)
(in 100 FCFA) (in Degrees)
Food 50 1200
35
Clothin 30 720 50 Food
g Clothing
Rent 20 480 15
Rent
Fuel
Fuel 15 360 Misc.
20
Misc. 35 840 30
14-Jan-25 91
Total 150 3600
Barcharts
Bar charts are a type of graph that are used to display and compare the number,
frequency or other measure (e.g. mean) for different discrete categories of data

14-Jan-25
14-Jan-25 92
92
Multiple Bar Chart

Multiple Bar Chart/Grouped bar charts


Multiple Bar Chart shows two or more characteristics
corresponding to values of a common variable in the form
of a grouped bars, whose lengths are proportional to the
values of the characteristics.
Example: Draw multiple bar charts to show the area and production of
cotton in Punjab for the following data:

Year Area (000 Production (000 bales)


acres)
1965-66 2866 1588
1970-71 3233 2229
1975-76 3420 1937
14-Jan-25 93
Multiple Bar Charts
Example (Area and Production of Cotton):

Year Area (000 Production (000


acres) bales)
1965-66 2866 1588
1970-71 3233 2229
1975-76 3420 1937

Area and Production of Cotton in Punjab


4000
3420
3500 3233
3000 2866

2500 2229
1937
2000
1588 Area (000 acres)
1500
Production (000 bales)
1000
500
0
1965-66 1970-71 1975-76
14-Jan-25 Years 94
Stacked or Component Bar Chart

Stacked bar charts


Stacked bar chars are similar to grouped bar charts in that they are used to display
information about the sub-groups that make up the different categories. In stacked
bar charts the bars representing the sub-groups are placed on top of each other to
make a single column, or side by side to make a single bar. The overall height or
length of the bar shows the total size of the category whilst different colours or
shadings are used to indicate the relative contribution of the different sub-groups

14-Jan-25 95
Example: Draw component bar chart of the students’
enrollment data:

Classes Total Male Female

BBA 65 33 32

MBA 60 32 28

MS/PHD 40 21 19

14-Jan-25 96
Component Bar Chart
Students’ Enrollment Data
Classes Total Male Female
BBA 65 33 32
MBA 60 32 28
MS/PHD 40 21 19

Component Bar Chart


70
60
No of Students

50 32
28
40
30 19 Female
20 Male
33 32
10 21
0
BBA MBA MS/PHD
Classes
14-Jan-25 97
Box-Plots/Boite à moustache

A way to graphically portray almost all the descriptive statistics


at once is the box-plot.

A box-plot shows: Upper and lower quartiles


Mean
Median
Range
Outliers (1.5 IQR)

14-Jan-25 98
Box plots have box from LQ to UQ, with median marked.
They portray a five-number summary of the data:
Minimum, LQ, Median, UQ, Maximum
except for outliers identified separately

Outlier = observation falling


below LQ – 1.5(IQR)
or above UQ + 1.5(IQR)

Ex. If LQ = 2, UQ = 10, then IQR = 8 and outliers above 10 +


1.5(8) = 22

14-Jan-25 99
Example 1: Box-plot
A box and whisker plot is based on the minimum and maximum values, the upper and lower
quartiles and the median. This type of plot provides a good way to compare two or more
samples.

14-Jan-25
14-Jan-25 101
Outliers

• An outlier is an observation which does not appear to belong


with the other data
• Outliers can arise because of a measurement or recording error
or because of equipment failure during an experiment, etc.
• An outlier might be indicative of a sub-population, e.g. an
abnormally low or high value in a medical test could indicate
presence of an illness in the patient.

14-Jan-25
102
Outlier Boxplot
• Re-define the upper and lower limits of the boxplots (the whisker lines) as:
Lower limit = Q1-1.5IQR, and
Upper limit = Q3+1.5IQR

• Note that the lines may not go as far as these limits


• If a data point is < lower limit or > upper limit, the data point is considered
to be an outlier.

outliers

14-Jan-25
103
Example
A gardener collected data on two types of tomato. The box and whisker plot below
shows data for the masses in grams of the tomatoes in the two samples. Compare
and contrast the two types and advise the gardener which type of tomato he should
grow in future.

14-Jan-25 104
Proposed Solution

Type A Type B
Median 52 grams 52 grams
Lower Quartile 49 grams 51 grams
Upper Quartile 57 grams 54 grams
Range 14 grams 8 grams
Interquartile Range 8 grams 3 grams

14-Jan-25 105
Discussion and Conclusion of Results

From this table we can see that both types of tomato have the same average mass
because their medians are the same.

Comparing the medians and interquartile ranges shows that there is far more
variation in the masses of the type A tomatoes, which means that the masses of
type B are more consistent than those of type A.

However, comparing the two box and whisker plots, and the upper quartiles, shows
that type A tomatoes will generally have a larger mass than those of type B.
Nevertheless, there will be some type A tomatoes that are lighter than any of type
B.

Taking all this together, the gardener would be best advised to plant type A
tomatoes in future as he is likely to get a better yield from them than from type B.

14-Jan-25 106
Exercises
Exercise 1

14-Jan-25 107
Exercise 2

14-Jan-25 108
Scatterplots
• The simplest graph for quantitative data
• Plots the measurements as points on a horizontal axis, stacking the points
that duplicate existing points.
• Displays the relationship between two continuous variables

Useful in the early stage of analysis when exploring data and determining is a
linear regression analysis is appropriate

May show outliers in your data


• Example: The set 4, 5, 5, 7, 6

4 5 6 7

14-Jan-25 109
Stem and Leaf Plot
METHOD:
• Sort the data series
• Separate the sorted data series into leading digits (the
stem) and the trailing digits (the leaves)
e.g. In 13, the leading digit (stem) is 1 and trailing digit
(leaf) is 3 and in 21, the leading digit (stem) is 2 and trailing
digit (leaf) is 1.
• List all stems in a column from low to high
• For each stem, list all associated leaves

11 14-Jan-25 110
0
Example 1: Consider the temperature data example.
The sorted data from low to high is shown below:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53,
58
Here, use the 10’s digit for the stem unit:

Stem Leaf
13 is shown as 1 3
21 is shown as 2 1
35 is shown as 3 5

14-Jan-25 111
… Stem and Leaf Plot

Sorted data is:12, 13, 17, 21, 24, 24, 26, 27, 28, 30, 32,
35, 37, 38, 41, 43, 44, 46, 53, 58

Completed Stem-and-leaf diagram


Stem Leaf
1 2 3 7
2 1 4 4 6 7 8
3 0 2 5 7 8
4 1 3 4 6
5 3 8

11 14-Jan-25 112
2
Stem-and-Leaf Plots

• Suppose you want to study the distribution of the scores


for a 100-point unit exam given in his first period biology
class. The scores of the 35 students in the class are
listed below.

82 77 49 84 44 98 93
71 76 65 89 95 78 69
89 64 88 54 87 91 80
44 85 93 89 55 62 79
90 86 75 74 99 62 96

14-Jan-25 113
To make a stem-and-leaf plot
82 77 49 84 44 98 93
71 76 65 89 95 78 69
89 64 88 54 87 91 80
44 85 93 89 55 62 79
90 86 75 74 99 62 96

Stem Leaf

4 944

First, make a vertical list of the stems. Since the test scores 5 45
range from 44 to 99, the stems range from 4 to 9. Then, plot
each number by placing the units digit (leaf) to the right of 6 59422
its correct stem. Thus, the scores 82 is plotted by placing
leaf 2 to the right of the stem 8. The complete stem-and-leaf 7 7168954
plot is shown at the right.
8 2499870596

9 83562309
Note: a stem may have one or more digits.
A leaf always has just one digit. 8|2 represents a score of 82.
14-Jan-25 114
Ex. 1: Use the information in the stem-and-leaf plots above to answer each
question.

1. What were the highest and the Stem Leaf


lowest scores? 99 and 44
4 449

2. Which test score occurred most 5 45


frequently? 89 (3 times)
6 22459

3. In which 10-point interval did 7 1456789


the most students score?
80-89(10 students) 8 0245678999

4. How many students received a 9 02335689


score of 70 or better?
25 students
14-Jan-25 115
Ex 2: Construct a stem-and-leaf display for the following data:

105 221 183 186 121 181 180 143


97 154 153 174 120 168 167 141
245 228 174 199 181 158 176 110
163 131 154 115 160 208 158 133
207 180 190 193 194 133 156 123
134 178 76 167 184 135 229 146
218 157 101 171 165 172 158 169
199 151 142 163 145 171 148 158
150 175 149 87 160 237 150 135
196 201 200 176 150 170 118 149

14-Jan-25 116
SOLUTION
We will select as stem values the numbers 7, 8, 9, 10, 11, …, 24.
The resulting stem-and-leaf diagram is presented in the following figure.

Stem Leaf Frequency


7 6 1
Inspection of this display immediately 8 7 1
reveals that most of the data lie 9 7 1
between 110 and 200 and that a 10 5 1 2
central value is somewhere between
11 5 8 0 3
150 and 160.
12 1 0 3 3
Furthermore, the data are distributed 13 4 1 3 5 3 5 6
approximately symmetrically about 14 2 9 5 8 3 1 6 9 8
the central value. 15 4 7 1 3 4 0 8 8 6 808 12
16 3 0 7 3 0 5 0 8 7 9 10
The stem-and-leaf diagram enables us 17 8 5 4 4 1 6 2 1 0 6 10
to determine quickly some important 18 0 3 6 1 4 1 0 7
feature of the data that were not
19 9 6 0 9 3 4 6
immediately obvious in the original
display in original table. 20 7 1 0 8 4
21 8 1
22 1 8 9 3
14-Jan-25
23 7 1 117
24 5 1
Back-To-Back Stem and Leaf Plot

• A back-to-back stem and leaf plot is sometimes used to


compare two sets of data or rounded and truncated values
of the same data.

• In a back-to-back plot, the same stem is used for the leaves


of both plots.
Note: Data with more than two digits can be rounded to two
digits before plotting or can be truncated to two digits.
To truncate means to cut off. For a stem and leaf plot, you
would truncate everything after the second digit
• The number 355 would round to 36
• The number 355 would truncate to 35
14-Jan-25 118
… Back-To-Back Stem and Leaf Plot

Example: The enrollments of eight colleges are


listed below. Make a back-to-back stem and leaf
plot of enrollments comparing rounded values and
truncated values.
College Enrollment
1 1342
2 1685
3 1013
4 2350
5 3781
6 1096
7 1960
11 14-Jan-25 8 3243 119
9
… Back-To-Back Stem and Leaf Plot

Step 1: Put data into order. Then round and truncate to two digits.

Step 2:
Construct Back-to-back stem
and leaf plot by using a single
stem.
12 14-Jan-25 120
0
NOTE

14-Jan-25 121
Complete a stem-and-leaf plot for the following list of times:
7.6, 8.1, 9.2, 6.8, 5.9, 6.2, 6.1, 5.8, 7.3, 8.1, 8.8, 7.4, 7.7, 8.2

Solution:
The stem-and-leaf plot only looks at the last digit (for the leaves) and all the digits
before (for the stem). The ones digits will be the stem values, and the tenths will be
the leaves.
Now, first, reorder this list:

5.8, 5.9, 6.1, 6.2, 6.8, 7.3, 7.4, 7.6, 7.7, 8.1, 8.1, 8.2, 8.8, 9.2

14-Jan-25 122
Exercises:
1. Complete a stem-and-leaf plot for the following list of values:
23.25, 24.13, 24.76, 24.81, 24.98, 25.31, 25.57, 25.89, 26.28, 26.34, 27.09

Using the last digit, the hundredths digit, for these numbers, the stem-and-leaf plot
will be enormously long, because these values are so spread out. It is therefore
reasonable if, instead of working with the given numbers, we rather round each to
the nearest tenth, and then use those new values for the plot.

Rounding gives me the following list:


23.3, 24.1, 24.8, 24.8, 25.0, 25.3, 25.6, 25.9, 26.3, 26.3, 27.1
Then my plot looks like this:

14-Jan-25 123
Exercise
The following scores represent the final examination grade for an elementary
statistics course:

(a) Construct a stem-and-leaf plot for the examination grades in which the stems are
1, 2, 3 , . . . . 9.
(b) Set up a relative frequency distribution.
(c) Construct a relative frequency histogram, draw an estimate of the graph of the
distribution and discuss the skewness of the distribution.
(d) Compute the sample mean, sample median, and sample standard deviation.

14-Jan-25 124
Inferential Statistics: uses sample data
to evaluate the credibility of a hypothesis
about a population

NULL Hypothesis:

NULL (nullus - latin): “not any”  no


differences between means

H0 : m1 = m2

Always
14-Jan-25 testing the null hypothesis
“H- Naught” 125
Inferential statistics: uses sample data to
evaluate the credibility of a hypothesis
about a population

Hypothesis: Scientific or alternative


hypothesis

Predicts that there are differences


between the groups

H1 : m1 = m2
14-Jan-25 126
Hypothesis
A statement about what findings are expected

null hypothesis
"the two groups will not differ“

alternative hypothesis
"group A will do better than group B"
"group A and B will not perform the same"

14-Jan-25 127
Inferential Statistics

When making comparisons


btw 2 sample means there are 2
possibilities

Null hypothesis is false


Null hypothesis is true

Reject the Null hypothesis


Not reject the Null Hypothesis
14-Jan-25 128
Chain of Reasoning for
Inferential Statistics

Selection
Sample
Population

Measure
Inference data

Probability

Are our inferences valid?…Best we can do is to calculate probability


about inferences
14-Jan-25 129
Possible Outcomes in
Hypothesis Testing (Decision)

Null is True Null is False


Correct
Accept Error
Decision
Type II Error

Correct
Reject Error
Decision
Type I Error

Type I Error: Rejecting a True Hypothesis


Type II Error: Accepting a False Hypothesis
14-Jan-25 130
ALPHA
the probability of making a type I error  depends on the
criterion you use to accept or reject the null hypothesis =
significance level (smaller you make alpha, the less likely
you are to commit error) 0.05 (5 chances in 100 that the
difference observed was really due to sampling error – 5%
of the time a type I error will occur)

Possible Outcomes in
Hypothesis Testing

Null is True Null is False

Alpha (a) Accept


Correct
Decision Error
Type II Error

Correct
Difference observed is really Reject Error
Decision
just sampling error Type I Error

The prob. of type one error


14-Jan-25 131
BETA
Probability of making type II error  occurs when we fail
to reject the Null when we should have

Possible Outcomes in
Hypothesis Testing

Null is True Null is False

Beta (b) Accept


Correct
Decision Error
Type II Error

Correct
Difference observed is real Reject Error
Decision
Failed to reject the Null Type I Error

POWER: ability to reduce type II error


14-Jan-25 132
POWER: ability to reduce type II error
(1-Beta) – Power Analysis

The power to find an effect if an effect is present

1. Increase our n

2. Decrease variability

3. More precise measurements

Effect Size: measure of the size of the difference


between means attributed to the treatment
14-Jan-25 133
Brief concept of Statistical Softwares

There are many softwares to perform statistical analysis and visualization of data.
Some of them are SAS (System for Statistical Analysis), S-plus, R, Python, Matlab,
Minitab, BMDP, Stata, SPSS, StatXact, Statistica, LISREL, JMP, GLIM, HIL, MS
Excel etc. We will discuss MS Excel and SPSS in brief.

Some useful websites for more information of statistical softwares-

http://www.galaxy.gmu.edu/papers/astr1.html
http://ourworld.compuserve.com/homepages/Rainer_Wuerlaender/statsoft.htm
#archiv
http://www.R-project.org

14-Jan-25 134
• Now you are qualified to use descriptive statistics!
• Questions?

14-Jan-25 135

You might also like