Data Analysis Training Workshop - Day 2 Presentation
Data Analysis Training Workshop - Day 2 Presentation
• renaming variables
• changing the values
of variables
• deleting variables
creating new
variables
Exercise: Coding and Labelling Variables
1. Using dataset 1 from the ‘example1.xlxs’ datafile:
a. Create a new variable called gender
b. Assign the code 1 to all males
c. Assign the code 2 to all females
d. What is the total number of males in the class?
e. What is the total number of females in the class?
ΣX
X =
n
Example
A household consisting of five household members with the following
ages: 45, 49, 25, 19, 11
Example 2
1. The heights of four basketball players, in inches, are: 76, 73, 80, 75
2. Arranging the data in ascending order gives: 73, 75, 76, 80.
3. Thus the median is 75.5
B. Characteristics of the Median
1. There is a unique median for each data set
2. It is not affected by extremely large or small values and is therefore,
a valuable measure of central tendency when such values occur
3. It can be computed for ratio-level, interval-level, and ordinal-level
data
C. The Mode
1. The mode is the value of the observation that appears the most
frequently
Mode<Median<Mean
1. Left skewed distribution
Negatively Skewed: Mean and Median are to the left of the Mode
Mean<Median<Mode
2. Standard Deviation
• Measure of dispersion
• Details the spread of the observed values around the mean
• Used in conjunction with the mean to summarise continuous data
• The standard deviation is influenced by the presence of outliers in a
dataset Low standard deviation = the data is closely dispersed around
the mean
2 High standard deviation = observations are widely dispersed
Σ 𝑋𝑋 − 𝑋𝑋̄
𝑠𝑠 = around the mean, estimates are usually unreliable
𝑛𝑛 − 1
Where,
𝑠𝑠 = sample standard deviation
Σ = sum of...
𝑋𝑋̄ = sample mean
𝑛𝑛 = number of scores in sample
3. PivotTables
1. Summarise, sort, reorganise, group, count, total or average data
2. Cross-tabulations
3. Make comparisons and identify patterns and trends
4. Flexible table development
a. Transform rows into columns and columns into rows
b. Group data by different fields
3. Steps to build a PivotTable
1. Select the cells you want to include in the PivotTable
2. Select insert tab on the ribbon and select PivotTable
3. A pop up box will appear, under Choose the data that you want to
analyze, select Select a table or range
4. In Table/Range, verify the cell range.
5. Under Choose where you want the PivotTable report to be placed,
select New worksheet to place the PivotTable in a new worksheet
or Existing worksheet and then select the location you want the
PivotTable to appear.
6. Select OK
3. Steps to build a PivotTable
7. To add a field to your PivotTable, select the field name checkbox in
the PivotTables Fields pane.
8. To move a field from one area to another, drag the field to the
target area.
9. Selected fields are added to their default areas:
a. Non-numeric fields are added to rows
b. Date and time hierarchies are added to columns
c. Numeric fields are added to values
Examples
Using data from the ‘Example 2.xlxs’ data file:
Example:
• H0 : There is no difference in the average height of males and females
• H1 : Males have a higher average height compared to females OR on average,
males are taller than females
3. Two-Tailed Test of Significance
A test is two-tailed when no direction is specified in the alternate
hypothesis H1 , such as:
Example:
• H0 : The mean amount spent by customers at Pick ‘n Pay in Grahamstown on
any day of the week is equal to R450.00
• H1 : The mean amount spent by customers at Pick ‘n Pay in Grahamstown on
any day of the week is not equal to R450.00 (µ ≠ R450).
4. Statistical Tests that are used in
Hypothesis Testing
2. Assume that hospital records show that on average babies are 3kgs at birth. We
have collected birth weight data during the lockdown period of the covid
pandemic and want to test if babies born during this lockdown period are born
at a lower weight compared to the historical birth weight records
a. Null hypothesis (H0): There is no difference between the average birth weight of babies and the birth
weight of babies born during covid
b. Alternative hypothesis (H1): on average, babies born during the covid pandemic have a lower birth
weigh compared to those born before the pandemic
c. One-tailed or two-tailed test? One-tailed test
6. Assumptions of a One-Sample T-Test
1. Data is continuous and of a quantitative scale – interval and ratio
2. The sample should be randomly selected from the population
3. The samples are independent of each other
4. Data should be normally distributed
5. Parametric test
6. Assumes that the data does not include any outliers
7. Standard deviation of the population unknown
8. Sample size below 30
6. Practical example
Using dataset 1 from the ‘example1.xlxs’ data file:
2. Calculate whether the average test score for Test 4 is greater than 30
a. State the null and alternate hypothesis
i. Null – The average test 4 score is not greater than 30 Or there is no difference between test 4
average and 30
ii. Alternate – The average test 4 score is greater than 30
iii. Is it a one-tailed or two-tailed test? One-tailed test
b. Conclusion – the average score for test 4 is statistically higher than 30
6. Interpretation of Results
1. The statistical test will produce various statistics
2. When interpreting the data, look out for the p-value, t-statistic and
critical value – all of these are generated in the statistical test
3. If the:
a. T-statistic is larger than the critical value If the:
T-stat > Critical Value
b. P-value is smaller than 0.05
P-value < 0.05
c. Then you reject the null hypothesis
Reject the null hypothesis
7. Two-Sample T-Test
1. Also called the Independent Samples T-Test
2. Used to test whether the unknown means of two independent
samples are different (or the same, equal)
3. Two variables:
a. Variable one: defines the two groups
b. Variable two: measurement of interest
Examples
1. Assume that we have two groups of students. One group has English as a first language and the
second group has English as their second language. Both groups of students take a reading
test. The assumption is that there is a difference in the average test scores across both groups
of students.
a. Null hypothesis (H0): There is no difference between the test scores of the English first language and second language
groups of students
b. Alternative hypothesis (H1): There is a statistically significant difference in the test scores between the English first language
and second language groups of students
c. One-tailed or two-tailed test: Two-tailed test
2. Assume that we have a random sample of males and females from a school of interest, and we
want to test whether there is a difference in the average hours per week spent in the school
library. The assumption is that female students spent more time in the library per week
compared to male students.
a. Null hypothesis (H0): There is no difference in the average hours spent in the library per week by male and female students
b. Alternative hypothesis (H1): On average, female students spend more time in the library per week compared to male
students
c. One-tailed or two-tailed test: One-tailed test
7. Assumptions of a Two-Sample T-Test
1. Data values must be independent
2. Sample selected through simple random sampling
3. Data in each group are normally distributed
4. Parametric test
5. Data values are continuous, ratio or interval scale
6. The variances for the two groups are equal
7. Practical Examples
Using dataset 1 from the ‘example1.xlxs’ data file:
1. Test whether there is a difference in the mean score between Test 1 and Test 2
a. Null – There is no difference between the mean score of test 1 and test 2
b. Alternate – There is a statistically significant difference between the mean score of test 1 and test 2
c. Is it a one-tailed or two-tailed test: Two-tailed test
d. Conclusion – There is a statistically significant difference between the average scores of test 1 and test 2
2. Test whether there is a difference between the mean score of Test 2 and Test 4
a. Null – There is no difference between the mean score of test 2 and test 4
b. Alternate – There is a statistically significant difference between the mean score of test 2 and test 4
c. Is it a one-tailed or two-tailed test: Two-tailed test
d. Conclusion –
7. Interpretation of the Results
1. The statistical test will produce various statistics
2. When interpreting the data, look out for the p-value, t-statistic and
critical value – all of these are generated in the statistical test
3. If the:
a. T-statistic is larger than the critical value If the:
T-stat > Critical Value
b. P-value is smaller than 0.05
P-value < 0.05
c. Then you reject the null hypothesis
Reject the null hypothesis
8. Chi-Square Test of Independence
1. Tests whether there is an association between two categorical
variables
2. Each variable should have at least 2 categories
3. Non-parametric test, nominal or ordinal scale
4. Does not assume that the data is normally distributed
5. Makes use of contingency tables to analyse the data
6. It does not show the strength or direction of the association
Examples
Test whether there is an association between gender and monthly
income
a. Null hypothesis (H0): there is no association between gender and monthly income in
the NIDS 2008 dataset
b. Alternative hypothesis (H1): there is an association between gender and monthly
income in the NIDS 2008 dataset
Average monthly income categories
0 – 5000 5001 – 10 000 10 001 – 15 000 15 001 – 20 000 Total
Gender
Male
Female
Total
Examples
Test whether there is an association between geographic location and
employment type
a. Null hypothesis (H0): there is no association between geographic location and employment type
b. Alternative hypothesis (H1): there is an association between geographic location and
employment type
Employment type
Formal Informal Total
Geographic
Urban
location
Rural
Farms
Total
8. Assumptions of a Chi-Square Test of
Independence
1. Variables must be on the nominal or ordinal scale
2. Categories of variables must be mutually exclusive
3. Sample selected through simple random sampling
4. Data in the contingency table needs to be frequencies or counts
5. Using the observed frequencies in the contingency table, the
expected frequencies are calculated