Lecture 1 2024
Lecture 1 2024
WITH SPSS
14-Jan-25 1
Course Outline
A. Overview of the basic concepts in Statistics
Some Key Definitions
Descriptive and Inferential Statistics
Data Presentation
B. Levels of Measurement
C. Levels of Measurement
D. Basic Statistics for Data Science
. Measures of central tendencies and of dispersions
.The Normal Distribution
. Hypothesis Testing
. Contingency Analysis /Chi squared statistics
. Correlation and Regression (inferences)
. One way ANOVA
E. Data conception, Collection, Preparation, Entry, and Analysis using SPSS
14-Jan-25 2
The circular process of research:
A decision is made as
The data is summarized how to collect the data
and analyzed
The data is
14-Jan-25
collected 3
14-Jan-25 4
14-Jan-25 5
14-Jan-25 6
14-Jan-25 7
14-Jan-25 8
Descriptive & Inferential Statistics
Descriptive Statistics Inferential Statistics
3 Types
2. Graphical Representations
14-Jan-25 11
14-Jan-25 12
14-Jan-25 13
14-Jan-25 14
14-Jan-25 15
Terminology
Populations & Samples
Population Sample
1/14/2025 Prof. Dr. Ndoh Mbue
14-Jan-25 17
In statistics:
14-Jan-25 18
Reasons:
• Cost
Accuracy
14-Jan-25 19
Accuracy
Data from a sample sometimes leads
to more accurate conclusions then data
from the entire population
Levels of Measurement
14-Jan-25 21
What is Measurement?
The assignment of numerals to objects or events according to rules.
Numerals are labels that have no inherent meaning, for example zip
codes, or automobile license plates.
■ The rules for assigning labels to properties of variables are the most
important components of measurement, because the result of poor rules is
meaningless outcomes.
Nominal
Ordinal
Interval
Ratio
These levels differ in how closely they approach the structure of the number
system we use.
The conclusions that can be drawn from research depend on the statistical
analysis used. 1/14/2025 Prof. Dr. Ndoh Mbue 23
14-Jan-25 23
Possible data types and levels of measure.
***The type of data you have dictates the type of analysis you will perform. 24
14-Jan-25 24
Nominal Scale
■ Ordering of categories does not exists. We cannot say one category is better or
worse, or more or less than another.
Example:
socioeconomic
class
grades
preferences
• For example, runners in the 100 meter dash finish 1st, 2nd, 3rd etc. Is the
number of seconds between 1st and 2nd place the same as those between 2nd
and 3rd place? Certainly not necessarily.
Most questionnaires use Likert type items. For example, we may ask teachers
about their job satisfaction.
Asking whether a teachers is very satisfied, satisfied, neutral, dissatisfied, or very
dissatisfied is using an ordinal scale of measurement.
" If a change from 1 to 2 has the same strength as a 4 to 5, then we would call it an
interval level measurement (if not, then it’s just an ordinal qualitative
measurement).
28
14-Jan-25 28
Ratio Scales
Ratio scales have all of the characteristics of the nominal,
ordinal and interval scales. In addition, however, ratio scales
have a true zero.
There are true ratios. One can use all mathematical
operations on this scale.
Examples:
weight
height
time
distance
* 10 miles is twice as long as 5 miles. 0 miles is no distance.
• In our descriptions of data in this course, we will assume that we are using
ratio scales most of the time. We call these PARAMETRIC STATISTICS.
• However, there will be times when all we have to work with are ordinal
scales. When we use these scales, our data will be rank ordered. We will
call these NONPARAMETRIC STATISTICS.
1/14/2025 Prof. Dr. Ndoh Mbue
14-Jan-25 29
Types of Variables
• A variable is a characteristic that changes or varies over time and/or for
different individuals or objects under consideration.
• Variables are the quantities measured in a sample. They may be classified as:
Quantitative
Qualitative
( Numerical)
(Categorical)
Nominal Ordinal
e.g. gender, ranked e.g. mild,
blood group moderate or Discrete Continuous
Hair color severe weather
– Independent/Predictor
• called a Factor when controlled by experimenter. It is often nominal
(e.g. treatment)
• Covariate when not controlled.
independent predictor
variable
14-Jan-25 33
14-Jan-25 34
Basic Statistics for Data
Science
14-Jan-25 35
Parameters & Statistics
1/14/2025 36
14-Jan-25
How many variables have you measured?
14-Jan-25 39
Indicateurs de localisation (ou de tendance centrale)
Mean
Let y denote a quantitative variable, with
observations y1 , y2 , y3 , … , yn
Mean:
y1 y2 ... yn yi
y
14-Jan-25
n n 40
Sample Mean
The arithmetic mean (or, simply, mean) is computed by summing all the
observations in the sample and dividing the sum by the number of
observations.
14-Jan-25 41
Median
14-Jan-25 42
Example of Median
Measurements Measurements
Ranked
• Median: (4+5)/2 = 4.5
x x
3 0 • Notice that only the two
5 1
central values are used
5 2
1 3
in the computation.
7 4
2 5 • The median is not
6 5
sensible to extreme
7 6
0 7
values
4 7
40 40
14-Jan-25 43
Example of Mode
Measurements
x
3
5
• In this case the data have tow
5 modes:
1 • 5 and 7
7
2 • Both measurements are
6 repeated twice
7
0
4
14-Jan-25 44
Example of Mode
Measurements
x
3
• Mode: 3
5
1
1
4
7
• Notice that it is possible
3 for a data not to have any
8 mode.
3
14-Jan-25 45
Properties of mean and median
• Mean valid for interval scales, median for interval or ordinal scales
14-Jan-25 46
In The Presence Of Outliers
Q: Do outliers affect the Mean and Median?
14-Jan-25 47
Measures of Central Tendency: Which Measure to Choose?
The median is often used, since the median is not sensitive to extreme
values. For example, median home prices may be reported for a region;
it is less sensitive to outliers.
In some situations it makes sense to report both the mean and the
median.
14-Jan-25 48
Common Distributional Shapes:
14-Jan-25 49
49
Mean, Median, Mode
Mean
Mean Mode Mode Mean
Median
Median Mode Median
Negatively Symmetric Positively
Skewed (Not Skewed) Skewed
14-Jan-25 50
…Skewness
• Measures asymmetry of data
– Positive or right skewed: Longer right tail
– Negative or left skewed: Longer left tail
Interpretation:
14-Jan-25 52
Kurtosis
14-Jan-25 54
Interpreting Graphs: Outliers
No Outliers Outlier
14-Jan-25 55
…Descriptive Statistics
• Variance
14-Jan-25 56
14-Jan-25 57
Range
To get the range for a variable, you subtract its lowest value from
its highest value.
Class A--IQs of 13 Students Class B--IQs of 13 Students
102 115 127 162
128 109 131 103
131 89 96 111
98 106 80 109
140 119 93 87
93 97 120 105
110 109
14-Jan-25 58
Class A Range = 140 - 89 = 51 Class B Range = 162 - 80 = 82
The sample variance of the n observations is
( yi y ) ( y1 y ) ... ( yn y )
2 2 2
s
2
n 1 n 1
s s 2
14-Jan-25 59
14-Jan-25 60
Variance and standard deviation
Variance: s
2 ( X X ) 2
ss
SS
n 1 n 1 df
• ‘Sum of Squares’ = SS
• degrees of freedom (df) = n-1
Standard
Deviation of sample: s
(X X ) 2
ss
SS
n 1 n 1 df
Standard Deviation for whole
population: ( x ) 2
14-Jan-25 N 61
Example
• For those in the student sample who attend religious
services at least once a week (n = 9 of the 60),
• y = 2, 3, 7, 5, 6, 7, 5, 6, 4
y 5.0,
(2 5) 2
(3 5) 2
... (4 5) 2
24
s
2
3.0
9 1 8
s 3.0 1.7
For entire sample (n = 60), mean = 3.0, standard deviation = 1.6, tends
to have similar variability but be more liberal
14-Jan-25 62
• Properties of the standard deviation:
• s 0, and only equals 0 if all observations are equal
• s increases with the amount of variation around the mean
• Division by n - 1 (not n) is due to technical reasons (later)
• s depends on the units of the data (e.g. measure euro vs $)
•Like mean, affected by outliers
14-Jan-25 63
Measures of position
p = 50: median
p = 25: lower quartile (LQ)
p = 75: upper quartile (UQ)
La médiane est considérée comme le second quartile (Q2). L’écart interquartile est
la différence entre le quartile supérieur et le quartile inférieur.
14-Jan-25 65
… Calculer l'étendue et l'écart interquartile
Pour commencer, vous devez arranger les valeurs en ordre croissant. Ce faisant, vous
pouvez donner un rang aux points de données.
14-Jan-25 66
Il vous faut ensuite trouver le rang de la médiane. Comme vu à
Rang Valeur
la section sur la médiane, lorsque le nombre de points est
impair, la médiane correspond à la valeur du point de rang
1 6 (n + 1) ÷ 2 = (11 + 1) ÷ 2 = 6
2 7
La médiane est le point de données de rang 6. Il y a donc 5
3 15
valeurs de chaque côté.
4 36
5 39 Vous devez séparer la moitié inférieure à la médiane en 2. Le
quartile inférieur sera donc la valeur du point de rang (5 +1) ÷2
6 41
= 3, ce qui donne Q1=15. La moitié supérieure à la médiane est
7 41 également séparée en 2. Le quartile supérieur sera la valeur du
8 43 point de rang 6 + 3 =9, ce qui donne Q3 = 43.
9 43
Une fois les quartiles trouvés, il est facile de mesurer la
10 47 dispersion. L’écart interquartile est Q3 - Q1, ce qui donne 28
11 49 (43-15). L’écart semi-interquartile est 14 (28 ÷ 2) et l’étendue
est de 43 (49-6).
14-Jan-25 67
Exercice: On a relevé les tailles en centimètres (cm) de 24 élèves d’une classe
d’un collège
taille en cm 151 153 155 158 160 165
effectif 2 5 8 5 3 1
14-Jan-25 68
Correction
14-Jan-25 69
Inter-quartile range
14-Jan-25 70
Methods of Variability Measurement
14-Jan-25 71
Which Measure To Use ?
Q: When is the mean better than median? When is the five number summary
better than the standard deviation?
Rules Of Thumb
A1: If outliers appear, or if your distribution is skewed, then the mean could be
affected, so use the median and the five number summary.
A2: If the distribution is reasonably symmetric and is free of outliers, then the
mean and standard deviation should be used.
14-Jan-25 72
Coefficient of Variation
s
CV 100%
x
• The CV is not affected by multiplicative changes in scale
• Consequently, a useful way of comparing the dispersion of
variables measured on different scales
14-Jan-25 73
Accuracy
14-Jan-25 74
Descriptive Statistics :Tables and Graphs
Summarizing Data:
Frequency Distributions
• An (Empirical) Frequency Distribution or Histogram
for a continuous variable presents the counts of
observations grouped within pre-specified classes or
groups
14-Jan-25 76
…Descriptive Statistics
Frequency Table
• Generally, the first approach to examining your data.
• Identifies distribution of variables overall
• Identifies potential outliers
– Investigate outliers as possible data entry errors
– Investigate a sample of others for data entry errors
14-Jan-25 77
Example
• A bag contains 25 candies:
• Raw Data:
m m m m m m m m m m
m m m m m m m m m m
m m m m m
• Statistical Table:
-
- Categories
-
Total
14-Jan-25 79
Example
14-Jan-25 80
ETABLIR DES CLASSES
Règle de Sturge :
Règle de Yule :
14-Jan-25 81
Age Tally Frequency Relative Percent
Frequency
25 to < 33 1111 5 5/50 = .10 10%
33 to < 41 1111 1111 1111 14 14/50 = .28 28%
41 to < 49 1111 1111 111 13 13/50 = .26 26%
49 to < 57 1111 1111 9 9/50 = .18 18%
57 to < 65 1111 11 7 7/50 = .14 14%
65 to < 73 11 2 2/50 = .04 4%
14-Jan-25 82
Describing the
Distribution
14-Jan-25 83
EXEMPLE
Prenons la longueur totale du crâne (mm) pour un sous échantillon de 60 souris
sylvestres adultes (I, II et III), tiré d’un échantillon de 122 souris de Landry (2000).
L’effectif de l’échantillon est de n=60.
Combien de classes ?
Selon les règles de Sturge et Yule, nous devrons donc définir 7 classes.
14-Jan-25 84
… Frequency Distributions
total
Democrats 24 1 25
Republican 19 6 25
Total 43 7 50
14-Jan-25 85
Graphs:Organizing Data
Diagrammes sectoriels (ou en camemberts)
Bar Chart
Pie Chart
14-Jan-25 86
Pie Charts For Qualitative Data
Pie Chart
Expenditure (in 100 rupees)
Food
Clothing
Rent
Fuel
Misc.
14-Jan-25 87
Pie Charts For Qualitative Data
Items Expenditure
(in 100 FCFA)
Food 50
Clothing 30
Rent 20
Fuel 15
Misc. 35
Total 150
14-Jan-25 88
… Pie Charts For Qualitative Data
14-Jan-25 89
… Pie Charts For Qualitative Data
Pie Chart
Items Expenditure Angles of sector Expenditure (in 100 FCFA)
(in 100 FCFA) (in Degrees)
Food 50 1200
35
Clothin 30 720 50 Food
g Clothing
Rent 20 480 15
Rent
Fuel
Fuel 15 360 Misc.
20
Misc. 35 840 30
14-Jan-25 91
Total 150 3600
Barcharts
Bar charts are a type of graph that are used to display and compare the number,
frequency or other measure (e.g. mean) for different discrete categories of data
14-Jan-25
14-Jan-25 92
92
Multiple Bar Chart
2500 2229
1937
2000
1588 Area (000 acres)
1500
Production (000 bales)
1000
500
0
1965-66 1970-71 1975-76
14-Jan-25 Years 94
Stacked or Component Bar Chart
14-Jan-25 95
Example: Draw component bar chart of the students’
enrollment data:
BBA 65 33 32
MBA 60 32 28
MS/PHD 40 21 19
14-Jan-25 96
Component Bar Chart
Students’ Enrollment Data
Classes Total Male Female
BBA 65 33 32
MBA 60 32 28
MS/PHD 40 21 19
50 32
28
40
30 19 Female
20 Male
33 32
10 21
0
BBA MBA MS/PHD
Classes
14-Jan-25 97
Box-Plots/Boite à moustache
14-Jan-25 98
Box plots have box from LQ to UQ, with median marked.
They portray a five-number summary of the data:
Minimum, LQ, Median, UQ, Maximum
except for outliers identified separately
14-Jan-25 99
Example 1: Box-plot
A box and whisker plot is based on the minimum and maximum values, the upper and lower
quartiles and the median. This type of plot provides a good way to compare two or more
samples.
14-Jan-25
14-Jan-25 101
Outliers
14-Jan-25
102
Outlier Boxplot
• Re-define the upper and lower limits of the boxplots (the whisker lines) as:
Lower limit = Q1-1.5IQR, and
Upper limit = Q3+1.5IQR
outliers
14-Jan-25
103
Example
A gardener collected data on two types of tomato. The box and whisker plot below
shows data for the masses in grams of the tomatoes in the two samples. Compare
and contrast the two types and advise the gardener which type of tomato he should
grow in future.
14-Jan-25 104
Proposed Solution
Type A Type B
Median 52 grams 52 grams
Lower Quartile 49 grams 51 grams
Upper Quartile 57 grams 54 grams
Range 14 grams 8 grams
Interquartile Range 8 grams 3 grams
14-Jan-25 105
Discussion and Conclusion of Results
From this table we can see that both types of tomato have the same average mass
because their medians are the same.
Comparing the medians and interquartile ranges shows that there is far more
variation in the masses of the type A tomatoes, which means that the masses of
type B are more consistent than those of type A.
However, comparing the two box and whisker plots, and the upper quartiles, shows
that type A tomatoes will generally have a larger mass than those of type B.
Nevertheless, there will be some type A tomatoes that are lighter than any of type
B.
Taking all this together, the gardener would be best advised to plant type A
tomatoes in future as he is likely to get a better yield from them than from type B.
14-Jan-25 106
Exercises
Exercise 1
14-Jan-25 107
Exercise 2
14-Jan-25 108
Scatterplots
• The simplest graph for quantitative data
• Plots the measurements as points on a horizontal axis, stacking the points
that duplicate existing points.
• Displays the relationship between two continuous variables
Useful in the early stage of analysis when exploring data and determining is a
linear regression analysis is appropriate
4 5 6 7
14-Jan-25 109
Stem and Leaf Plot
METHOD:
• Sort the data series
• Separate the sorted data series into leading digits (the
stem) and the trailing digits (the leaves)
e.g. In 13, the leading digit (stem) is 1 and trailing digit
(leaf) is 3 and in 21, the leading digit (stem) is 2 and trailing
digit (leaf) is 1.
• List all stems in a column from low to high
• For each stem, list all associated leaves
11 14-Jan-25 110
0
Example 1: Consider the temperature data example.
The sorted data from low to high is shown below:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53,
58
Here, use the 10’s digit for the stem unit:
Stem Leaf
13 is shown as 1 3
21 is shown as 2 1
35 is shown as 3 5
14-Jan-25 111
… Stem and Leaf Plot
Sorted data is:12, 13, 17, 21, 24, 24, 26, 27, 28, 30, 32,
35, 37, 38, 41, 43, 44, 46, 53, 58
11 14-Jan-25 112
2
Stem-and-Leaf Plots
82 77 49 84 44 98 93
71 76 65 89 95 78 69
89 64 88 54 87 91 80
44 85 93 89 55 62 79
90 86 75 74 99 62 96
14-Jan-25 113
To make a stem-and-leaf plot
82 77 49 84 44 98 93
71 76 65 89 95 78 69
89 64 88 54 87 91 80
44 85 93 89 55 62 79
90 86 75 74 99 62 96
Stem Leaf
4 944
First, make a vertical list of the stems. Since the test scores 5 45
range from 44 to 99, the stems range from 4 to 9. Then, plot
each number by placing the units digit (leaf) to the right of 6 59422
its correct stem. Thus, the scores 82 is plotted by placing
leaf 2 to the right of the stem 8. The complete stem-and-leaf 7 7168954
plot is shown at the right.
8 2499870596
9 83562309
Note: a stem may have one or more digits.
A leaf always has just one digit. 8|2 represents a score of 82.
14-Jan-25 114
Ex. 1: Use the information in the stem-and-leaf plots above to answer each
question.
14-Jan-25 116
SOLUTION
We will select as stem values the numbers 7, 8, 9, 10, 11, …, 24.
The resulting stem-and-leaf diagram is presented in the following figure.
Step 1: Put data into order. Then round and truncate to two digits.
Step 2:
Construct Back-to-back stem
and leaf plot by using a single
stem.
12 14-Jan-25 120
0
NOTE
14-Jan-25 121
Complete a stem-and-leaf plot for the following list of times:
7.6, 8.1, 9.2, 6.8, 5.9, 6.2, 6.1, 5.8, 7.3, 8.1, 8.8, 7.4, 7.7, 8.2
Solution:
The stem-and-leaf plot only looks at the last digit (for the leaves) and all the digits
before (for the stem). The ones digits will be the stem values, and the tenths will be
the leaves.
Now, first, reorder this list:
5.8, 5.9, 6.1, 6.2, 6.8, 7.3, 7.4, 7.6, 7.7, 8.1, 8.1, 8.2, 8.8, 9.2
14-Jan-25 122
Exercises:
1. Complete a stem-and-leaf plot for the following list of values:
23.25, 24.13, 24.76, 24.81, 24.98, 25.31, 25.57, 25.89, 26.28, 26.34, 27.09
Using the last digit, the hundredths digit, for these numbers, the stem-and-leaf plot
will be enormously long, because these values are so spread out. It is therefore
reasonable if, instead of working with the given numbers, we rather round each to
the nearest tenth, and then use those new values for the plot.
14-Jan-25 123
Exercise
The following scores represent the final examination grade for an elementary
statistics course:
(a) Construct a stem-and-leaf plot for the examination grades in which the stems are
1, 2, 3 , . . . . 9.
(b) Set up a relative frequency distribution.
(c) Construct a relative frequency histogram, draw an estimate of the graph of the
distribution and discuss the skewness of the distribution.
(d) Compute the sample mean, sample median, and sample standard deviation.
14-Jan-25 124
Inferential Statistics: uses sample data
to evaluate the credibility of a hypothesis
about a population
NULL Hypothesis:
H0 : m1 = m2
Always
14-Jan-25 testing the null hypothesis
“H- Naught” 125
Inferential statistics: uses sample data to
evaluate the credibility of a hypothesis
about a population
H1 : m1 = m2
14-Jan-25 126
Hypothesis
A statement about what findings are expected
null hypothesis
"the two groups will not differ“
alternative hypothesis
"group A will do better than group B"
"group A and B will not perform the same"
14-Jan-25 127
Inferential Statistics
Selection
Sample
Population
Measure
Inference data
Probability
Correct
Reject Error
Decision
Type I Error
Possible Outcomes in
Hypothesis Testing
Correct
Difference observed is really Reject Error
Decision
just sampling error Type I Error
Possible Outcomes in
Hypothesis Testing
Correct
Difference observed is real Reject Error
Decision
Failed to reject the Null Type I Error
1. Increase our n
2. Decrease variability
There are many softwares to perform statistical analysis and visualization of data.
Some of them are SAS (System for Statistical Analysis), S-plus, R, Python, Matlab,
Minitab, BMDP, Stata, SPSS, StatXact, Statistica, LISREL, JMP, GLIM, HIL, MS
Excel etc. We will discuss MS Excel and SPSS in brief.
http://www.galaxy.gmu.edu/papers/astr1.html
http://ourworld.compuserve.com/homepages/Rainer_Wuerlaender/statsoft.htm
#archiv
http://www.R-project.org
14-Jan-25 134
• Now you are qualified to use descriptive statistics!
• Questions?
14-Jan-25 135