0% found this document useful (0 votes)
18 views58 pages

Lec 7 8

Uploaded by

moumenabuzaid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views58 pages

Lec 7 8

Uploaded by

moumenabuzaid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Data exploration

ContentMeasures of central tendency


A- Measures of dispersion

B- Methods of testing data normality in SPSS


1-Frequency distribution curve
2-Boxplot
3-Test of normality

C- management of outliers

D- Assignments

Measures of central tendency


Mean= the sum of data values divided by the number of data records.
Median= the middle data point (or average of the two middle data
points if there is an even number of observations) when all data
points are lined up in either ascending or descending order.

Mode= The most frequently occurring data point value in a dataset.


Example1: -calculate the mean, median, and mode for the
following sample dataset;
2,4,8,9,5,2,6,12
Mean = (2+4+8+9+5+2+6+12)/8 = 6

Median = 2, 2, 4, 5, 6, 8, 9, 12 = n/2 ,(n/2)+1


= (4th +5th)/2=(5+6)/2 = 5.5

Mode = (2). This value occurs twice while other


values occur only once.
Measures of dispersion
-Range = The highest measurement – the lowest measurement in the
dataset.

-Variance = A measure of dispersion around the mean, equal to the


sum of squared deviations from the mean divided by one less than
the number of cases.

-Standard deviation (SD) is the squared root of variance.


Methods to assess data normality via SPSS
(data exploration)

1-Frequency distribution curve

2-Boxplot

3-Test of normality

Assessment of data normality by frequency distribution curve


• A bell-shaped curve
• Measures of central tendency are equal (mean=median=mode)
• The far left and right tails are symmetrical with skewness and kurtosis=zero
• Data are considered normally distributed if skewness divided by standard error
of skewness and kurtosis divided by standard error of kurtosis give values
between(-2 and 2).
• Accuracy of frequency distribution curve shape and values are very sensitive
to sample size, so not to depend entirely on it to test data normality.
A- positively skewed distribution (skewness> 0 ):
-data are clustered to the left, with the tail extending to the right (right skewed).
-mean > median > mode
B- negatively skewed distribution (skewness < 0 ):
-data are clustered to the right, with the tail extending to the left (left skewed).
-mode> median > mean
Positive skewed (Rt skewed) Negative skewed (Lt skewed)
Positive kurtosis (leptokurtic):
- indicated by a sharp peak.
-observations are more clustered about the center of the distribution having a thinner tails

Negative kurtosis (Platykurtic):


-indicated by a flat distribution.

Positive kurtosis (peaked curve) Negative kurtosis (flat curve)


-observations are less clustered about the center of the distribution and have thicker tails.
Steps of obtaining frequency distribution curve in SPSS; 1-
Graphs Method:
1. Click graphs, Legacy Dialogues, select “Histogram.”
2. Identify the relevant variable by moving its name from the box on
the left to the box labeled “Variable.” 3. Click display normal
curve then “OK.”

2-Analyze Method:
1- Select “Descriptive Statistics, Select “frequencies.” 2-
Move the measured variable to a box named variable(s).
3-Select mean, standard deviation, skewness, kurtosis, standard error
of skewness, and standard error of kurtosis, click continue. 4-Select
charts, histograms, click show normal curve on histogram, click
continue then ok.
Example 2: by using SPSS find frequency distribution curve for marks of Maths
exam of 20 students, then assess data normality noting that the total score in
Maths exam is 100;
2,2,2,55,55,59,60,61,61,66,71,72,72,74,93,93,97,99,100,100
Class Frequency

0 -20 3

20 -40 0

40 -60 3

60 -80 8

80 -100 4

100 -120 2
Interpretation of results
-Frequency distribution curve shows that the left and right tails are not
symmetrical denoting negative skewness

-Statistics table show that skewness divided by standard error of


skewness equal 2 and kurtosis divided by standard error of kurtosis
gives a value of 0.49.

-The curve shape and its statistics values give signs of abnormal data
distribution.

- It is confirmed by the other two methods; Boxplot and test of


normality
BOXPLOT
-It is a graphical summary of data that requires few numerical calculations.
-Seven numbers summarize the data of boxplot:
1. lower fence
2.smallest value (minimum)
3.first quartile (Q1)
4. median (second quartile Q2)
5.third quartile (Q3)
6.greatest value (maximum)
7.upper fence

How to calculate quartiles


Odd number Even number
Q1=(n+1)/4 Q1=(n+2)/4
Q2 (median)=(n+1)/2 Q2 (median)=n/2 , (n/2)+1
Q3=3(n+1)/4 Q3=(3n+2)/4
Relating a box plot to distribution shape
-For a symmetric distribution, the box plot is also symmetric.
The median is generally in the middle of the box and the
whiskers are approximately equal in length.
-Positively skewed distributions are characterised by a cluster
of data values at the left-hand end of the distribution with a
gradual tailing off to the right.

-The median is off-centre and generally to the left. The


lefthand whisker will be short, while the right-hand whisker
will be long, reflecting the gradual tailing off of data values to
the right and its clustering to the left.
Negatively skewed distributions: are characterized by a
clustering of data values to the right-hand side of the
distribution, with a gradual tailing off to the left.

-The median off-centre and generally to the right. The


righthand whisker will be short, while the left-hand whisker
will be long, reflecting the gradual tailing off of data values to
the left and its clustering to the right.
Positive kurtosis:
- indicated by a peak.
-data are more clustered about the center of the distribution having thinner tails
in frequency distribution curve and long whiskers in boxplot.

Negative kurtosis:
-indicated by a flat distribution.
-data are less clustered about the center of the distribution and have thicker tails
in frequency distribution curve and short whiskers in boxplot.
Example 2: Using the following data set, draw a boxplot representing its
distribution:
13,15,20,18,17,26,19,12,24,28,16,11,25,22,14

Data ranking:
11,12,13,14,15,16,17,18,19,20,22,24,25,26,28
Q1 = the order value of (n+1)/4) =14
Q2 =the order value of (n+1)/2) = 18
Q3= the order value of 3(n+1)/4= 24
Minimum=11
Maximum=28
Lower fence= Q1-1.5(Q3-Q1)=14-1.5(24-14)=-1
Upper fence= Q3+1.5(Q3-Q1)=24+1.5(24-14)=39
If the minimum and maximum values are between
lower and upper fences, this indicates no presence
of outliers
In this example; minimum and maximum are 11, 28
which are between -1 and 39

So
no outliers are detected and confirmed in SPSS boxplot (no outliers)
and non-significant normality tests as follows;
Steps of obtaining boxplot and test of normality in SPSS
1. Select Analyze --> Descriptive Statistics --> Explore.
2. Move all variables into the “Variable(s)” window.
3. Click statistics, select outliers, continue
3. Click “Plots”, and click “Normality plots with tests, continue
4. Click OK.
Example3: Fifteen subjects suffering from knee osteoarthritis
volunteered to participate in this study. They were assessed for
isokinetic peak torque value of knee extensors before and 3 months
after exercise program. Examine the normality of the following
collected data via SPSS:
Pre-training knee ext. Post-training knee ext.
PT (N.m) PT (N.m)
37 54
22 43
45 42
29 35
28 57
33 31
35 58
40 49
39 48
46 56
43 48
22 57
20 43
47 56
42 38
Output of
Statistics SPSS
pre training post training
peak torque peak torque
N Valid 15 15
Missing 0 0
Mean 35.2000 47.6667
Std. Deviation 9.15891 8.73962
Skewness -.422- -.470-
Std. Error of Skewness .580 .580
Kurtosis -1.160- -.917-
Std. Error of Kurtosis 1.121 1.121
Frequency distribution curve
Tests of Normality

Kolmogorov- Shapiro-Wilk
Smirnova
Statistic df Sig. Statistic df Sig.

pre training .128 15 0.20 .924 15 .222


peak torque
post trainning .166 15 0.20 .917 15 .175
peak torque
17/02/2015 31
Conclusion: it can be concluded that the data are
normally distributed from the following;
1-Frequency distribution curve: statistics table shows
that skewness divided by standard error of skewness
and kurtosis divided by standard error of kurtosis give
values between(-2 and 2).

2- Test of normality (p > 0.05).

3- Boxplot: no outliers are detected.


Example 4: The following are three datasets of test marks of English at the
beginning of term, mid term and end of term; explore these data using SPSS and
make a conclusion.
Beginning of term Mid term End of term
50.00 44.00 6.00
49.00 32.00 12.00
37.00 2.00 36.00
49.00 50.00 1.00
48.00 4.00 50.00
44.00 44.00 38.00
43.00 49.00 50.00
42.00 50.00 47.00
36.00 50.00 45.00
40.00 44.00 11.00
Output of SPSS
Statistics
Beginning End of
of term Mid term term
N Valid 10 10 10
Missing 0 0 0
Skewness -.266- -1.465- -.390-
Std. Error of Skewness .687 .687 .687

Kurtosis -1.374- .643 -1.912-


Std. Error of Kurtosis 1.334 1.334 1.334

Percentiles 25 39.2500 25.0000 9.7500


50 43.5000 44.0000 37.0000
75 49.0000 50.0000 47.7500
Tests of Normality

Kolmogorov- Shapiro-Wilk
Smirnova
Statistic df Sig. Statistic df Sig.

Beginning of .194 10 .200* .915 10 .320


term
Mid term .348 10 .001 .708 10 .001
End of term .227 10 .155 .843 10 .048
Conclusion:
-Concerning results of beginning of term, it can be
concluded that the data are normally distributed from
the following;
1-Frequency distribution curve: statistics table shows
that skewness divided by standard error of skewness
and kurtosis divided by standard error of kurtosis give
values between(-2 and 2).

2- Test of normality: shapiro-wilk test (p > 0.05).

3- Boxplot: no outliers are detected.


Conclusion:
-Concerning results of mid term, it can be concluded
that the data are not normally distributed from the
following;
1-Frequency distribution curve: statistics table shows
that skewness divided by standard error of skewness
give values less than (-2).

2- Test of normality: shapiro-wilk test (p < 0.05).

3- Boxplot: two outliers are detected; their values are 2


and 4.
Concerning the results of end of term, there are
contradicting results from data exploration: 1-
Frequency distribution curve: statistics table shows
that skewness divided by standard error of skewness
and kurtosis divided by standard error of kurtosis give
values less than -2

2- Test of normality: shapiro-wilk test (p < 0.05).

3- Boxplot: no outliers are detected.

How to deal with outliers?


1-Trimming method:
by deleting the entire case in which outlier is present.

2-Winsorising method: by
replacing the outlier by:
A-the highest or the lowest near value of data according to the outlier
type(uppermost or lowermost).
Or
B-the mean value of variable data.

N.B: Each method is allowed if outliers represent only 5% of data.


Example 5: in the following data 1 and 80 are outliers.
How can you deal with them?
1, 14,13, 15, 16, 13, 15, 16, 17, 15, 9, 10, 22, 8, 80
Answer
80 and 1 are the uppermost and lowermost values By
trimming:
The uppermost or the lowermost value or both of them are deleted
till reaching the non-significant Shapiro-wilk test as follows;
(1, 14,13, 15, 16, 13, 15, 16, 17, 15, 9, 10, 22, 8) By
winsorising:
1 is replaced by 8 or 80 is replaced by 22 or both replacements are
done till reaching the non-significant Shapiro-wilk test as follows:
(1, 14,13, 15, 16, 13, 15, 16, 17, 15, 9, 10, 22, 8, 22)
Before management of outliers , boxplot appears with one circle-shaped outlier (lower most value
and it the 1st value of data) that is below the lower limit value and one asterisk-shaped outlier (upper
most value and it the 15th value of data) that is above the upper limit value.
Boxplot after trimming of only one upper most extreme value of (80) (No outliers
and non-significant test of normality)
Boxplot after winsorising the upper most and lower most extreme values
80 is replaced with the nearest allowed maximum value of 22 and
1 is replaced with the nearest allowed minimum value of 8 (No outliers
and non-significant test of normality)
Contradicting results of test of normality and boxplot
1- What can you do in the following situations?
Non-significant test of normality(p>0.05) but
outliers are detected in a boxplot

apply statistical parametric test for data with and without outlier (trimming
method) and compare results:

A-If results are not the same, management of outliers should be done before
statistical analysis test.
B-if results are the same, no correction of data is done (use the original data)
Contradicting results of test of normality and boxplot

2- What can you do in the following situation?

significant test of normalilty (p < 0.05)

but
No outliers are detected in a boxplot

Solution
-In this case, the extreme highest or lowest data value must be managed for
test of normality becomes non-significant
Converting SPSS file into word or pdf file

-Click file
-Select export
-Select word or pdf document
-Select browse
-Select a site of saving
-Click save and ok
Assignment 1
Exercise 1: given the following output of data exploration using SPSS; -
Discuss these results and make a conclusion.
Tests of Normality Statistics
Kolmogorov- Shapiro
Smirnova VAR00001
Statist df Sig. Statist Valid 10
N
ic ic Missing 0
Skewness .494
Std. Error of Skewness .687
.236 10 .120 .871 10
Kurtosis -.255-
Std. Error of Kurtosis 1.334
25 15.2500
Percentiles 50 17.0000
75 23.7500
Exercise 2: the following is a data set of test marks of
fifteen students;
(19,22,30,43,29,18,11,17,24,16,100,45,57,44,38)
1-Explore these data using all available tests of SPSS.
2-Write a report about the results.

You might also like