0% found this document useful (0 votes)
11 views52 pages

Data Analysis Training Workshop - Day 2 Presentation

The document outlines a virtual Data Analysis Training Workshop scheduled for June 24-26, 2024, presented by Dr. Reesha Kara, covering topics such as MS Excel, descriptive statistics, and hypothesis testing. It includes detailed sections on variable construction, coding and labeling, measures of central tendency, standard deviation, and various statistical tests. The workshop aims to equip participants with essential data analysis skills and understanding of hypothesis testing methodologies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views52 pages

Data Analysis Training Workshop - Day 2 Presentation

The document outlines a virtual Data Analysis Training Workshop scheduled for June 24-26, 2024, presented by Dr. Reesha Kara, covering topics such as MS Excel, descriptive statistics, and hypothesis testing. It includes detailed sections on variable construction, coding and labeling, measures of central tendency, standard deviation, and various statistical tests. The workshop aims to equip participants with essential data analysis skills and understanding of hypothesis testing methodologies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Institute of Social and Economic

Research, Rhodes University

Data Analysis Training Workshop


24- 26 June 2024
Day 2
Virtual workshop

Presented by Dr Reesha Kara


Structure of the day
Section A: Introduction to Ms Excel Section C: Introduction to Hypothesis
Testing
1. Variable Construction 1. What is a Hypothesis and Hypothesis
Testing?
2. Coding and Labelling
2. Definitions
3. Tests of Significance
Section B: Descriptive Statistics 4. Statistical Tests used in Hypothesis Testing
1. Measures of Central Tendency 5. Understanding the p-value
2. Standard Deviation 6. One-Sample T-Test
3. PivotTables 7. Two-Sample T-Test
8. Chi-Square Test of Independence
Section A: Introduction to MS
Excel
1. Variable Construction
2. Coding and Labelling of Variables
1. Variable Construction
1. Variable construction is important when you need to create derived
variables from your existing data to answer a research question

2. Using the dataset 1 in the ‘example1.xlsx’ datafile:


a. Create a variable that represents the average score of test 1 and test 2 for each
student.
b. Ensure that the variable is labelled correctly.
c. Round off the average to the nearest whole number.
d. Reorder the data in descending order (highest to lowest)
e. Identify the first 5 students with the lowest average score from test 1 and test 2
2. Coding and Labelling Variables
1. Generating a new variable, already done
2. Different way:
a. Assign a value to a labelled variable
b. Can be used for mathematical functions
c. Can be used for coding data
2. Coding and Labelling Variables
1. Click on the Formulas tab on the ribbon
2. Select Define Name
3. A pop out box headed ‘New Name’ will appear on the screen
2. Coding and Labelling Variables

Name – variable name


Scope – which sheet or
workbook you want the
variable to appear in
Comment – details about
the variable
Refers to – the value that
you would like to assign to
the variable
2. Coding and Labelling Variables

Name manager allows


you to edit the variables
by:

• renaming variables
• changing the values
of variables
• deleting variables
creating new
variables
Exercise: Coding and Labelling Variables
1. Using dataset 1 from the ‘example1.xlxs’ datafile:
a. Create a new variable called gender
b. Assign the code 1 to all males
c. Assign the code 2 to all females
d. What is the total number of males in the class?
e. What is the total number of females in the class?

f. Create a new variable called race


g. Assign codes as follows:
 1 – Black
 2 – Coloured
 3 – Indian
 4 – White
h. What is the racial distribution of the class?

i. Create a variable to represent the average score of each test


j. On average what was the average score of this exam period?
Section B: Descriptive Statistics
1. Measures of Central Tendency
2. Standard Deviation
3. PivotTables
1. Measures of central tendency
A. Mean
Population
B. Median Sample
C. Mode

1. Measures of central tendency are useful in describing data and aids


in understanding the distribution of the data
2. They are also referred to as measures of location as the pinpoint
the centre of the distribution of the data
A. The Mean
1. The Mean is the most widely used measure of location
2. It is calculated by summing the values and dividing by the number
of values

ΣX
X =
n
Example
A household consisting of five household members with the following
ages: 45, 49, 25, 19, 11

What is the average age of household members?

Σ𝑋𝑋 45+49+25+19+11 149


̄
𝑋𝑋 = = = = 29.8
𝑛𝑛 5 5
A. Characteristics of the Mean
1. Every set of interval-level and ratio-level data has a mean
2. All the values are included in computing the mean
3. A set of data has a unique mean
4. The mean is affected by unusually large or small data values
B. The Median
1. The Median is the midpoint of the values after they have been
ordered from the smallest to the largest
2. There are as many values above the median as below it in the data
array
3. For an even set of values, the median will be the arithmetic average
of the two middle numbers
Examples
Example 1
1. The ages for a sample of five students are: 21, 25, 19, 20, 22
2. Arranging the data in ascending order gives: 19, 20, 21, 22, 25.
3. Thus the median is 21.

Example 2
1. The heights of four basketball players, in inches, are: 76, 73, 80, 75
2. Arranging the data in ascending order gives: 73, 75, 76, 80.
3. Thus the median is 75.5
B. Characteristics of the Median
1. There is a unique median for each data set
2. It is not affected by extremely large or small values and is therefore,
a valuable measure of central tendency when such values occur
3. It can be computed for ratio-level, interval-level, and ordinal-level
data
C. The Mode
1. The mode is the value of the observation that appears the most
frequently

Example: The exam scores for ten students are:


81, 93, 84, 75, 68, 87, 81, 75, 81, 87
Rearrange the score in ascending order:
68, 75, 75, 81, 81, 81, 84, 87, 87, 93
Because the score of 81 occurs the most often, it is the mode.
1. Symmetric distribution
Zero skewness

mode = median = mean


1. Right skewed distribution
Positively skewed: Mean and Median are to the right of the Mode

Mode<Median<Mean
1. Left skewed distribution
Negatively Skewed: Mean and Median are to the left of the Mode

Mean<Median<Mode
2. Standard Deviation
• Measure of dispersion
• Details the spread of the observed values around the mean
• Used in conjunction with the mean to summarise continuous data
• The standard deviation is influenced by the presence of outliers in a
dataset Low standard deviation = the data is closely dispersed around
the mean
2 High standard deviation = observations are widely dispersed
Σ 𝑋𝑋 − 𝑋𝑋̄
𝑠𝑠 = around the mean, estimates are usually unreliable
𝑛𝑛 − 1
Where,
𝑠𝑠 = sample standard deviation
Σ = sum of...
𝑋𝑋̄ = sample mean
𝑛𝑛 = number of scores in sample
3. PivotTables
1. Summarise, sort, reorganise, group, count, total or average data
2. Cross-tabulations
3. Make comparisons and identify patterns and trends
4. Flexible table development
a. Transform rows into columns and columns into rows
b. Group data by different fields
3. Steps to build a PivotTable
1. Select the cells you want to include in the PivotTable
2. Select insert tab on the ribbon and select PivotTable
3. A pop up box will appear, under Choose the data that you want to
analyze, select Select a table or range
4. In Table/Range, verify the cell range.
5. Under Choose where you want the PivotTable report to be placed,
select New worksheet to place the PivotTable in a new worksheet
or Existing worksheet and then select the location you want the
PivotTable to appear.
6. Select OK
3. Steps to build a PivotTable
7. To add a field to your PivotTable, select the field name checkbox in
the PivotTables Fields pane.
8. To move a field from one area to another, drag the field to the
target area.
9. Selected fields are added to their default areas:
a. Non-numeric fields are added to rows
b. Date and time hierarchies are added to columns
c. Numeric fields are added to values
Examples
Using data from the ‘Example 2.xlxs’ data file:

1. Make a PivotTable showing the average weight of dogs by food type


2. Create a PivotTable showing the average height of each dog by dog
type and brand of food
3. Develop a PivotTable detailing the average weight of dogs by food
brand
4. Create a PivotTable showing the average age and weight of dogs by
food brand
5. Provide a the max and min age of dogs by food type
Section C: Introduction to
Hypothesis Testing
1. What is Hypothesis Testing?
2. Definitions
3. Tests of Significance
4. Statistical Tests used in Hypothesis Testing
5. Understanding p-values
6. One-Sample T-Test
7. Two-Sample T-Test
8. Chi-Square Test of Independence
1. What is a Hypothesis and Hypothesis
Testing?
• A Hypothesis is a statement about an expectation or prediction that
will be tested by research
• Tentative statement about the relationship between two or more
variables
• It is developed for the specific purpose of testing and is informed by
the body of existing literature
• Hypothesis testing is a procedure used to determine whether the
hypothesis is a reasonable statement and should not be rejected, or is
unreasonable and should be rejected.
Example of a Hypothesis
A study designed to look at the relationship between sleep deprivation
and exam performance might have a hypothesis that states:

This study is designed to assess the hypothesis that sleep-deprived


people will perform worse on a test than individuals who are not
sleep-deprived
2. Definitions
1. A Null Hypothesis H0 : A statement that will be tested
2. An Alternative Hypothesis H1 : A statement that is accepted if the
sample data provides evidence that the null hypothesis is false
3. A Level of Significance: The probability of rejecting the null
hypothesis when it is actually true
a. Denoted as α (alpha)
b. 0.05 level of significance is the accepted standard in social science research
2. Definitions
4. A Type I Error: Rejecting the null hypothesis when it is actually true.
5. A Type II Error: Failing to reject the null hypothesis when it is
actually false.
6. A Test statistic: A value, determined from sample information, used
to determine whether or not to reject the null hypothesis.
7. A Critical value: (Rejection region) The dividing point between the
region where the null hypothesis is rejected and the region where it
is not rejected.
3. One-Tailed Test of Significance
A test is one-tailed when the alternate hypothesis, H1, states a direction

Example:
• H0 : There is no difference in the average height of males and females
• H1 : Males have a higher average height compared to females OR on average,
males are taller than females
3. Two-Tailed Test of Significance
A test is two-tailed when no direction is specified in the alternate
hypothesis H1 , such as:

Example:
• H0 : The mean amount spent by customers at Pick ‘n Pay in Grahamstown on
any day of the week is equal to R450.00
• H1 : The mean amount spent by customers at Pick ‘n Pay in Grahamstown on
any day of the week is not equal to R450.00 (µ ≠ R450).
4. Statistical Tests that are used in
Hypothesis Testing

1. One – Sample T-Test (Student’s T Test)


2. Two – Sample T-Test
3. Chi- Square Test of Independence
4. Pearson's Correlation Analysis
5. Regression Analysis
5. Understanding p-values
1. Used to support or reject the null hypothesis
2. Smaller p-value, stronger evidence to reject the null
3. Smaller p-value, more significant results Assumed standard in Social Science
research
4. P-value compared to the alpha Confidence level = 95%
100% – 95%
5. Alpha levels are related to confidence levels 5%
Therefore, α = 0.05
Compare the generated p-value to the
alpha value
Small p-value (≤0.05), reject the null hypothesis. Strong evidence
that the null hypothesis is invalid.
Large p-value (≥0.05), fail to reject the null hypothesis. Alternate
hypothesis is weak.
6. One-Sample T-Test
1. Also called a Student’s T-Test
2. Allows for the comparison of a population mean with a
hypothesised value or a one-sample mean
3. Aim of the test is to test whether the means are statistically
different
4. Example …
Examples
1. Historical data shows that the average score on the Politics exam is 67%. Does
this differ from the current year Politics average exam score?
a. Null hypothesis (H0): There is no difference between the historical average score on the Politics exams
and the current average score OR the current average score on the Politics exam is 67%
b. Alternative hypothesis (H1): there is a statistically significant difference in the average exam scores
c. One-tailed or two-tailed test? Two-tailed test

2. Assume that hospital records show that on average babies are 3kgs at birth. We
have collected birth weight data during the lockdown period of the covid
pandemic and want to test if babies born during this lockdown period are born
at a lower weight compared to the historical birth weight records
a. Null hypothesis (H0): There is no difference between the average birth weight of babies and the birth
weight of babies born during covid
b. Alternative hypothesis (H1): on average, babies born during the covid pandemic have a lower birth
weigh compared to those born before the pandemic
c. One-tailed or two-tailed test? One-tailed test
6. Assumptions of a One-Sample T-Test
1. Data is continuous and of a quantitative scale – interval and ratio
2. The sample should be randomly selected from the population
3. The samples are independent of each other
4. Data should be normally distributed
5. Parametric test
6. Assumes that the data does not include any outliers
7. Standard deviation of the population unknown
8. Sample size below 30
6. Practical example
Using dataset 1 from the ‘example1.xlxs’ data file:

1. Calculate whether the average test score for Test 3 is equal to 25


a. State the null and alternate hypothesis
i. Null – there is no difference between average for test 3 and 25
ii. Alternate – there is a difference between the average score for test 3 and 25 OR there average
score for test 3 is not equal to 25
iii. Is it a one-tailed or two-tailed test? Two-tailed test
b. Conclusion – there is a statistically significant difference between the average score for test 3 and 25

2. Calculate whether the average test score for Test 4 is greater than 30
a. State the null and alternate hypothesis
i. Null – The average test 4 score is not greater than 30 Or there is no difference between test 4
average and 30
ii. Alternate – The average test 4 score is greater than 30
iii. Is it a one-tailed or two-tailed test? One-tailed test
b. Conclusion – the average score for test 4 is statistically higher than 30
6. Interpretation of Results
1. The statistical test will produce various statistics
2. When interpreting the data, look out for the p-value, t-statistic and
critical value – all of these are generated in the statistical test
3. If the:
a. T-statistic is larger than the critical value If the:
T-stat > Critical Value
b. P-value is smaller than 0.05
P-value < 0.05
c. Then you reject the null hypothesis
Reject the null hypothesis
7. Two-Sample T-Test
1. Also called the Independent Samples T-Test
2. Used to test whether the unknown means of two independent
samples are different (or the same, equal)
3. Two variables:
a. Variable one: defines the two groups
b. Variable two: measurement of interest
Examples
1. Assume that we have two groups of students. One group has English as a first language and the
second group has English as their second language. Both groups of students take a reading
test. The assumption is that there is a difference in the average test scores across both groups
of students.
a. Null hypothesis (H0): There is no difference between the test scores of the English first language and second language
groups of students
b. Alternative hypothesis (H1): There is a statistically significant difference in the test scores between the English first language
and second language groups of students
c. One-tailed or two-tailed test: Two-tailed test

2. Assume that we have a random sample of males and females from a school of interest, and we
want to test whether there is a difference in the average hours per week spent in the school
library. The assumption is that female students spent more time in the library per week
compared to male students.
a. Null hypothesis (H0): There is no difference in the average hours spent in the library per week by male and female students
b. Alternative hypothesis (H1): On average, female students spend more time in the library per week compared to male
students
c. One-tailed or two-tailed test: One-tailed test
7. Assumptions of a Two-Sample T-Test
1. Data values must be independent
2. Sample selected through simple random sampling
3. Data in each group are normally distributed
4. Parametric test
5. Data values are continuous, ratio or interval scale
6. The variances for the two groups are equal
7. Practical Examples
Using dataset 1 from the ‘example1.xlxs’ data file:

1. Test whether there is a difference in the mean score between Test 1 and Test 2
a. Null – There is no difference between the mean score of test 1 and test 2
b. Alternate – There is a statistically significant difference between the mean score of test 1 and test 2
c. Is it a one-tailed or two-tailed test: Two-tailed test
d. Conclusion – There is a statistically significant difference between the average scores of test 1 and test 2

2. Test whether there is a difference between the mean score of Test 2 and Test 4
a. Null – There is no difference between the mean score of test 2 and test 4
b. Alternate – There is a statistically significant difference between the mean score of test 2 and test 4
c. Is it a one-tailed or two-tailed test: Two-tailed test
d. Conclusion –
7. Interpretation of the Results
1. The statistical test will produce various statistics
2. When interpreting the data, look out for the p-value, t-statistic and
critical value – all of these are generated in the statistical test
3. If the:
a. T-statistic is larger than the critical value If the:
T-stat > Critical Value
b. P-value is smaller than 0.05
P-value < 0.05
c. Then you reject the null hypothesis
Reject the null hypothesis
8. Chi-Square Test of Independence
1. Tests whether there is an association between two categorical
variables
2. Each variable should have at least 2 categories
3. Non-parametric test, nominal or ordinal scale
4. Does not assume that the data is normally distributed
5. Makes use of contingency tables to analyse the data
6. It does not show the strength or direction of the association
Examples
Test whether there is an association between gender and monthly
income
a. Null hypothesis (H0): there is no association between gender and monthly income in
the NIDS 2008 dataset
b. Alternative hypothesis (H1): there is an association between gender and monthly
income in the NIDS 2008 dataset
Average monthly income categories
0 – 5000 5001 – 10 000 10 001 – 15 000 15 001 – 20 000 Total
Gender

Male
Female
Total
Examples
Test whether there is an association between geographic location and
employment type
a. Null hypothesis (H0): there is no association between geographic location and employment type
b. Alternative hypothesis (H1): there is an association between geographic location and
employment type
Employment type
Formal Informal Total
Geographic

Urban
location

Rural
Farms
Total
8. Assumptions of a Chi-Square Test of
Independence
1. Variables must be on the nominal or ordinal scale
2. Categories of variables must be mutually exclusive
3. Sample selected through simple random sampling
4. Data in the contingency table needs to be frequencies or counts
5. Using the observed frequencies in the contingency table, the
expected frequencies are calculated

Expected Frequencies = (row total*column total)/grand total


8. Practical examples
Using dataset 1 from the ‘NIDS 2008_practice data.xlxs’
1. Test whether there is an association between gender and highest level of
education in the NIDS 2008 subsample data
a. Null –
b. Alternative –
c. Conclusion –

Using dataset 2 from the ‘NIDS 2008_practice data.xlxs’


2. Test whether there is an association between highest level of education
and employment status among the NIDS 2008 subsample
a. Null –
b. Alternate –
c. Conclusion –
8. Interpretation of the Results
Steps to follow when conducting a Chi-Square Test of Independence:
1. Calculate the expected frequency using the formula
2. Calculate the p-value
3. Interpret the p-value
4. Draw a conclusion Interpreting the p-value

If p-value < 0.05


Reject the null hypothesis
End of day 2, thank you!

You might also like