0% found this document useful (0 votes)
0 views

lecture 2 - Advanced Topics (1)

The document provides an overview of data exploration using SPSS, focusing on types of variables (qualitative vs. quantitative), levels of measurement, and methods for exploring categorical and continuous variables. It discusses summary statistics, including measures of central tendency and dispersion, and introduces bivariate analysis to examine relationships between two variables. Additionally, it covers correlation coefficients and the importance of understanding correlation versus causation.

Uploaded by

amanvocational
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

lecture 2 - Advanced Topics (1)

The document provides an overview of data exploration using SPSS, focusing on types of variables (qualitative vs. quantitative), levels of measurement, and methods for exploring categorical and continuous variables. It discusses summary statistics, including measures of central tendency and dispersion, and introduces bivariate analysis to examine relationships between two variables. Additionally, it covers correlation coefficients and the importance of understanding correlation versus causation.

Uploaded by

amanvocational
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

DATA EXPLORATION USING

SPSS
Types of variables

• Qualitative : color of your eyes


• {Green, Brown, Black, Blue, ……}

Versus

• Quantitative: number of males and females, age,


body weight.
• {60, 64, 65.4, 90, 91.7,………}
Numerical vs. Categorical Vars.
• Definition: A numerical variable taking on a continuum of
values is called continuous and one that only takes on a
discrete set of values is called discrete.

• Definition: A categorical variable is ordinal if the


categories can be logically ordered from smallest to
largest in a sense meaningful for the question at hand (we
need to rule out silly orders like alphabetical); otherwise it
is unordered or nominal.
Types of variables
The quantitative variables may be classified into:

• Discrete: number of students in a class


• {24, 36, 50, 7, ……} No decimal points

Versus

• Continuous: your body weight.


• {60, 64, 65.4, 90, 91.7,………} May Have decimal points
Levels of Measurement
Levels of Measurement as ordered ascendingly:

• Nominal: Names and classifications are used to divide data


into separate and distinct categories.

• Ordinal: Measurements that rank observations into categories


with a meaningful order.

• Interval: Measurements on a numerical scale in which the


value of zero is arbitrary but the difference between values is
important.

• Ratio: Numerical measurement in which zero is a meaningful


value and the difference between values is important.
Part I: Exploring One variable
Contents
• What is data visualization and summary statistics?
• Recall: variable types and measurement scales.
• Exploring categorical variable (tabulation – graphical
presentation).
• Exploring Continuous variable (tabulation – graphical
presentation).

• All of the above would be presented using SPSS.


What is data exploration?
• This is to develop a high-level understanding of the data,
learn about the possible values for each characteristic,
and find out how a characteristic varies among individuals
in our sample.
• In short, we want to learn about the distribution of
variables.
• Recall that for a variable, the distribution shows the
possible values, the chance of observing those values,
and how often we expect to see them in a random sample
from the population.
Recall: variable types and measurement
scales.
Exploring categorical variable (tabulation
– graphical presentation).
• data_virusC.sav

• Examples of Categorical variables:


“Categorical are those measured with Nominal or Ordinal”
‫ األسمي‬:‫• أوال‬
.‫ كاتب كتابه‬،‫ منفصل‬،‫ أرمل‬،‫ مطلق‬،‫ متزوج‬، ‫ أعزب‬،‫ غير معروف‬:‫• الحالة الزواجية‬
(B_REC_ 1)
‫ الترتيبي‬:‫• ثانيا‬
‫ جامعي‬،‫ما يعادله‬/‫ ثانوي‬،‫اعدادي‬/‫ أبتدائي‬،‫يكتب‬/‫ يقرأ‬،‫ أمي‬،‫ غير معروف‬:‫• الحالة التعليمية‬
(B_REC_2) .‫فأعلي‬
First: Exploring One Categorical Variable
Using SPSS
First: Exploring One Categorical Variable
Using SPSS
First: Exploring One Categorical Variable
Using SPSS
First: Exploring One Categorical Variable
Using SPSS
Exploring continuous variable (tabulation
– graphical presentation).
• data_virusC.sav

• Examples of continuous variables:


“Categorical are those measured with Interval or Ratio”

(A_REC_20) .‫• عمر المبحوث بالسنوات الكاملة‬


.‫• ما هي عدد السنوات اللي تقدر انك تقول فيها انك دخنت بشكل يومي‬
(B_REC_5)
(C_REC_1) .‫• من كام سنة عرفت انك مريض بفيروس سي‬
First: Exploring One Continuous Variable
Using SPSS
First: Exploring One Contiunous Variable
Using SPSS
First: Exploring One Continuous Variable
Using SPSS
Shapes of Histogram
Part II: Comparing the distribution of one
variable between sample subgroups
Contents
• Exploring categorical variable between sample subgroups
(tabulation – graphical presentation).

• Exploring Continuous variable between sample


subgroups (tabulation – graphical presentation).

• All of the above would be presented using SPSS.


First: Comparing the distribution of one Categorical
Variable between sample subgroups Using SPSS
• Education by Sex
First: Comparing the distribution of one Continuous
Variable between sample subgroups Using SPSS
No. of years as a smoker by sex
Part III: Understanding Summary
Statistics
Contents
• Defining the meaning of the main summary measures.

• Get knowledge about the mathematical formula used to


calculate the most important summary measures; mean
and standard deviation.

• Identification of the appropriate summary measure with


respect to the type and the measurement scale of
variables.
• How the relationship between different summary measure
shape the variable distribution in the sample.
Summary Definition/ Meaning
Measures

Central Tendency Measures


Mean The average of the observed values
Median The median is the number at the middle of the sorted
observations.
Mode Most frequent value (the class with the highest frequency)
Quartiles / Review the median. (25%), (10%).
Percentiles

Dispersion Measures
Range The difference between minimum and maximum value
Variance The sample variance is a common measure of dispersion
based on the squared deviations.
Standard Deviation The square root of the variance
Summary Definition/ Meaning
Measures

Other important Measures


Inter-Quartile The range is the difference between the maximum observed
Range value and the minimum observed value. The interquartile
range (IQR) is the difference between the third quartile (Q3)
and the first quartile (Q1). Compared to the range, the IQR is
less sensitive to outliers, which usually fall below Q1 or above
Q3.

Coefficient of To quantify dispersion independently from units, we use the


Variation coefficient of variation, which is the standard deviation divided
by the sample mean (assuming that the mean is a positive
number).
Summary Measures Measurement scale of the variable
Nominal Ordinal Continuous
Central Tendency Measures
Mean   
Median   
Mode   
Quartiles/ Percentiles   
Dispersion Measures
Range   
Variance   
Standard Deviation   
Other important Measures
Inter-Quartile Range   
Coefficient of Variation   
Relationship between sample mean and
median
Boxplots: To visualize the five-number summary, the range
and the IQR, we often use a boxplot (a.k.a. box and whisker plot).
Part IV: Relationship between two
variables
Why we conduct bivariate analysis?
• Bivariate refers to analyzing TWO variables, as the name
implies.

• In commonsense, when conducting a bivariate analysis,


we are using statistics to explore the relationship between
two variables and quantifying it.
The objective of this
Lecture
• In other words, we use statistics to determine the strength
and the significance of the relationship between the two
variables
Examining Relationships Between Two
Variables (Bivariate).
• First we have to draw “A Scatter Diagram”

• Scatter diagram as an indication of strength and


direction of correlation
• linear or nonlinear relationships
• correlation or no correlation
• Strong or weak
• Direction (positive or negative)
Examples of Scatter Diagrams of Linear
Relationship
Take care of outliers as it can produce
misleading results
Examples of Scatter Diagrams of
Nonlinear Relationships

 In this course, we are focusing on Linear


relationships only.
Testing the independence between two variables (The
famous Chi-Square 𝝌𝟐)
Assumptions
• Frequencies represent individual counts
• Categories are exhaustive and mutually exclusive

Rational
• Test of independence between the Row and Column
Variables:
• Compare the observed to the expected cell counts under the
assumption of independence.

Validity
• Expected cell size > 5 ( or use Yate’s correction)
Conducting Chi-Square Analysis
1) The Null Hypothesis

H0: the variables are independent (there is no correlation)


2) Determine the expected frequencies

3) Create a table with observed frequencies,


expected frequencies. (O’s and E’s)
4) Find the degrees of freedom: (c-1)(r-1)

5) Find the chi-square statistic in the Chi-Square


Distribution table
For Example if we have 𝜶 = 𝟎. 𝟎𝟓 and r=2
Conducting Chi-Square Analysis

(O−E)2
• Calculate 𝜒2 =
E

• If chi-square statistic > your calculated chi-square


(O−E)2
value 𝜒 2 = , you do not reject your null
E
hypothesis and vice versa. Which means that the two
variables are independent. In other words, there is no
correlation between the two variables.
General note on Chi Square Statistics
• Require large samples
• Chi square statistic is sensitive to increase in sample size.
Increase in sample size increases Chi square even if the
association is the same, thus yielding misleading results.
• Ignore information if the variables were ordinal in nature;
less powerful for testing ordinal variables
Correlation
• A correlation coefficient is a numerical summary of the
type and strength of a relationship between variables.

Bivariate Partial Multiple

• It considers the • It considers the • It considers the


association association association
(linear between two between more
relationships) variables than two
between two controlling for variables.
variables. other
variable(s).
First: Bivariate Correlation

• If we want to test the association between Two


Nominal Variables, we may either use:

• Symmetric measures; which yields whether there is a


correlation or not, such as:
• The Phi Coefficient (if the two variables are
dichotomous)
• The contingency Coefficient
• The Cramer’s V
• Directional measures; which determines what
variable is the dependent or in other words, it
yields the direction of the relationship between
variables, such as:
• Lambda
• Uncertainty coefficient

• It should be noted that most of them are functions of Chi-


Square Statistic so that we should pay considerable
attention of Chi-Square statistic assumptions and validity
requirements.
• The Phi Coefficient (if the two variables are
dichotomous)

𝜒2
•𝜙 = , where
𝑛
• 𝜒 2 is the calculated Chi-square value

• 𝑛 is the sample size (no. of observations)


• The Contingency Coefficient

𝜒2
•𝐶 = 2 , where
𝑛+𝜒

• 𝜒 2 is the calculated Chi-square value

• 𝑛 is the sample size (no. of observations)


• Cramer’s V

𝜒2
•𝑉 = , where
𝑛 𝑘−1
• 𝜒 2 is the calculated Chi-square value

• 𝑛 is the sample size (no. of observations)

• 𝑘; the value of k in the formula, look at the number of


possible values of each variable (the number of rows
and columns in the data Matrix). The smaller of the two
numbers is used to represent the variable k.
• Uncertainty Coefficient
Two Ordinal Variables and small samples

• we may use a nonparametric correlation


coefficient such as:
• Gamma (no adjustment for table size or ties)

• Kendall’s Tau (adjusted for ties)

• Tau-b (for square tables)

• Tau-c (for rectangular tables)


• Spearman’s Rho, which is the simplest and may be calculated
without computer.
• If we want to test the association between Two Ordinal
Variables, and we have large samples, Or Two
Interval/Ratio Variables, we may use the most famous
parametric correlation coefficient

Pearson correlation coefficient (r)


• If we have a Nominal Variable; or more specifically
Dichotomous and an Ordinal Variable, we may use
Rank Biserial.

• If one variable is Nominal, or more specifically


Dichotomous and the other is Continuous (with interval
/ratio level), we may use Point Biserial.
Where rxy : Pearson’s Correlation Coefficient, rs :
Spearman’s Rho Coefficient, C: contingency
coefficient, Φ𝑐 : Cramer’s Phi (Cramer’s V) , 𝑟Φ : Phi
Coefficient
Second: Partial Correlation
Coefficient
• A partial correlation explains the relationship between two
variables while statistically controlling for the influence of
one or more other variables (sometimes called effects
analysis or elaboration).

• A partial correlation coefficient takes the form rab.c = +/-


x, read, “The partial correlation between variable a and
variable b when controlling for c is …”
Second: Partial Correlation
Coefficient
• A First-order partial correlation controls for one other
variable; higher-order partial correlation controls for two
or more variables; zero-order correlation (Bivariate) is a
correlation between two variables with no variable being
controlled.

• A semi-partial correlation partials out a variable from


one of the other variables being correlated. A semi-partial
correlation coefficient takes the form ra(b.c) = +/-x, read,
“The semi-partial correlation coefficient of variables a and
b after variable c has been patriated out from variable b is
…”
Third: Multiple Correlation Coefficient
• A multiple correlation is computed when researchers
want to assess the relationship between the variable they
wish to explain, the criterion variable, and two or more
other independent variables working together; and the
procedure yields two types of statistics:
1. A multiple correlation coefficient is just like a correlation
coefficient, except that it tells researchers how two or more
variables working together are related to the criterion variable of
interest.
Third: Multiple Correlation Coefficient
2. A multiple correlation coefficient indicates both the direction
and the strength of the relationship between a criterion variable
and the other variables.

• It takes the form Ra.b.x = +/-x, read “The multiple


correlation of variables b and c with variable a (the
criterion variable) is . . . .”
Correlation and causation
• Knowing that two variables, X and Y are correlated, does
not provide any information on how they are related to
each other. The correlation could be a result of:
1. Common response: Both variables X and Y respond to
changes in some unobserved variable(s). [often called
as instrumental variables]
2. Confounding: X’s effect on Y is hopelessly mixed up
with another unobserved variable’s effect on Y.
3. X causes Y: The order of events has to be clear.
Usually, valid conclusion can only be based on
controlled experiments.
• So that, we have to identify precisely the effect of x (x’s)on
y controlling for the effect of other variables (whatever
observed or unobserved)

• This is can be done through developing the appropriate


regression model

You might also like