0% found this document useful (0 votes)
15 views51 pages

Statistical Concepts

Miguel Rodriguez, a Ph.D. student in Mathematics Education, is researching critical statistical literacy (CSL) in preservice teacher education to assess its benefits and implications on data modeling. The document outlines the use of CODAP for teaching statistics and provides an introduction to key statistical concepts, including descriptive and inferential statistics, measures of central tendency, and the construction of box plots. It emphasizes the importance of collaboration and creating a supportive learning environment while exploring statistical relationships and regression analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views51 pages

Statistical Concepts

Miguel Rodriguez, a Ph.D. student in Mathematics Education, is researching critical statistical literacy (CSL) in preservice teacher education to assess its benefits and implications on data modeling. The document outlines the use of CODAP for teaching statistics and provides an introduction to key statistical concepts, including descriptive and inferential statistics, measures of central tendency, and the construction of box plots. It emphasizes the importance of collaboration and creating a supportive learning environment while exploring statistical relationships and regression analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Statistics

INTRODUCING CRITICAL STATISTICAL LITERACY


A little bit about me…
Hi I am Miguel Rodriguez,
A second year student in the Ph.D. program in Mathematics Education PRIME.
Also, I am a student in the MSc in Applied Statistics.
Now I am developing my practicum research project. This is a necessary step in my Ph.D.
journey.
I am very interested in statistics education in preservice teacher and undergraduate level.
I have been a mathematics teacher for almost 10 years, with most of my experience in
secondary education.
I am also passionate for performative arts, specially theatre and dance.
Title: Introducing critical statistical literacy in preservice teacher preparation,
imagining alternatives
Goals:
To identify possible benefits, impacts, limitations, etc., of CSL in preservice
teacher education.
To inform about participants' attitudes and appreciation of CSL in their future
practice.
To provide a panorama of the implications of CSL on participants' data modeling
process as a manifestation of statistical practice.
RQs:
How does preservice teachers' interest in implementing critical statistical literacy
(CSL) in their future practice vary after accessing CSL?
What uses of statistics and statistical arguments emerge in the data modeling
process of preservice teachers when they get access to CSL?
Some recommendations to take into account

Be willing to work with others.


Be open to share your contributions to the group, all opinions, questions,
comments, matter.
Be respectful, kind, and supportive.
Don’t be ashamed for asking when you need clarification. I’m more than happy
to provide support.
Please try to don’t miss the lessons. Surely, your participation will enrich the
classroom.
Contribute to a safe space for everyone.
See this space as an opportunity to learn something new, and to learn in
community.
Intro: Getting started
with CODAP
Common Online Data Analysis Platform
CODAP (Common Online Data Analysis Platform) is a free, web-based, open-
source software designed for teaching students about dynamic data explorations.
It will be the primary technology tool used in these materials.

To get started with CODAP, please do the following:

1) In a web browser (Google Chrome is preferred), go to CODAP’s website at


http://codap.concord.org

2) Click the Try CODAP button in the top right corner to open CODAP in a new
tab.

3) In the ‘What Would you Like to Do?’ box, select ‘Open Document or Browse
Examples’, select ‘Getting started with CODAP’, then click Open.
4) Complete the five basic CODAP tasks listed on the screen. If you need
assistance with a task, click ‘Show me’. You’re finished with these tasks when all
task checkboxes have been checked.

5) Next you will add a ‘Body Mass Index’ attribute to the case table to learn how to
add an attribute based on a formula.

● Resize the table by dragging the right edge or a lower corner until you can see
all nine attributes.
● Add a new attribute to this table by clicking the grey plus button in the top right
corner of the table. Make sure the table is selected to see this option.
Type ‘BMI’ then press Enter to name the new attribute.

● Click on the BMI attribute heading, then select ‘Edit Formula’. Enter the
formula Mass/Height^2. You can find attribute names, like ‘Mass’ and
‘Height’, under the ---Insert Value--- button.

Then click on Apply.


6) Graph a dotplot of the BMI attribute. Notice there are two upper outlier
mammals with BMIs of 400 or greater (you can hover your mouse over a point in
the graph to see its value).

7) Select those points in the graph by dragging a rectangle around them and look
at the table to see what mammals they are. CODAP’s representations are linked
dynamically, so if you select items in one representation they are automatically
selected in all other representations.
8) Hide Selected Cases by clicking the Eye icon in the Graph Menu (make sure
you have the graph selected for the menu to appear). This will remove those two,
selected outliers from the graph.
9) Rescale the graph by moving the mouse to the x-axis where it changes to a
hand icon, then dragging to the right.

10) There are several methods for saving a CODAP file.

● Save a File to Google Drive by clicking on the menu in the top left corner
of the header bar, selecting ‘Save…’, selecting the Google Drive tab in the
prompt box (second option), then following the Google Drive dialogue.
● Save a File to a shared URL by clicking on the menu in the top left corner
of the header bar, selecting ‘Share…’, selecting ‘Get link to shared view’,
enabling sharing, then copying the displayed URL. To save additional work
done after initially saving a fi le, select ‘Share… > Update shared view’.
Initial Terminology
INTRODUCING CRITICAL STATISTICAL LITERACY
What is statistics?

“Statistics is the science of learning from data and of measuring, controlling, and
communicating uncertainty.” American Statistical Association (ASA)

“Statistics has three primary components: How best can we collect data? How
should it be analyzed? And what can we infer from the analysis?” (Diez, et al.,
2015)

What questions from current events or from your own life can you think of
that could be answered by collecting and analyzing data?
The Statistical Investigation Cycle

GAISEIIPreK-12_Full.pdf
Population and Sample

“If a factory produces Population


thousands of electronic Parameter
components, instead of testing
each item, quality control
teams might randomly sample
a certain number of items Estimators
(e.g., 100 components) and
check how many are
defective. If they find that 4 Sample
out of 100 components are Statistic
defective, they can estimate
the proportion of defective
items in the entire production
batch as 4%”.
What Statistics?

Descriptive Inferential
Statistics Statistics
“Consists of methods for “Consists of methods
organizing, displaying, that use sample results
and describing data by to help make decisions
using tables, graphs, and or predictions about a
summary measures”. population”.

Mean, median, mode, Hypothesis testing, type


SD, etc. error, p-value, etc.
Ordinal Level of
satisfaction

Categorical
Nominal Party
afiliation

VARIABLE
Discrete # of
siblings
Numerical
Continuous Height

“It is a characteristic or measurement that can be determined for each member of a population”.
What kind of variable do you think is “phone number”?
Measures of Central Tendency
Example
Data about students’ height (in cm) from a classroom
Data set
(195,170,165,165,160) Sample size (n) = 5

Sample mean: Mode: 165 Median:


160 165 165 170 195

165

n is even n is odd

Average of (n + 1) ÷ 2 and n ÷ 2) + 1 (n + 1) ÷ 2
Variable

Case

DATA MATRIX
Bar Graph
Ap Kiwifru Bluebe
Fruit: Orange Banana Grapes
ple it rry
Peop
35 30 10 25 40 5
le:
-In a bar graph, the length of the bar for each
category represents the number of observations
in each category (frequency).

-Bars may be vertical or horizontal.

-We use bar graphs when we want to compare


categories or show changes over time.

-Frequencies are shown on the Y-axis and the


variable being compared is shown on the X-axis.

-The percentage of observations in each


category as is typical in pie charts.
Box plot

-It uses boxes and lines to depict the


distributions of one or more groups of numeric
data

- It is a type of chart that depicts a group of


numerical data through their quartiles.

- It displays key summary statistics: median,


quartiles and potential outliers.
Activities proposed for today

- Activity 1

- Activity 2

- Activity 3
For the data set (195,170,165,165,160)

Dispersion
Some Measures of dispersion
Ex.2

Dataset
3
5
6
8
11
14
17
24

Mean= 11 SD= 6.595 MAD= 5.5


Ex.1

Dataset
3
5
6
8
11
14
17
200

Mean= 33 SD= 63.27 MAD= 51.75


Mean = 171±12.41
Box plots
A box plot summarizes a data set using five summary statistics while
also plotting unusual observations, called outliers.

Five-number summary: the minimum, the maximum, and the three


quartiles (Q1, Q2, Q3) of the data set being studied.

Q2 represents the second quartile, which is equivalent to the 50th


percentile (i.e. the median).

Q1 represents the first quartile, which is the 25th percentile, and is the
median of the smaller half of the data set.

Q3 represents the third quartile, or 75th percentile, and is the median of


the larger half of the data set.

We calculate the variability in the data using the range of the middle
50% of the data:
Q3 - Q1, interquartile range (IQR, for short).
Box plots

What do you notice?


Box plots
How to Build a Box Plot

Draw an axis (vertical or horizontal) and draw a scale.

Draw a dark line denoting Q2, the median.

Draw a line at Q1 and at Q3. Connect the Q1and Q3 lines to


form a rectangle.

The width of the rectangle corresponds to the IQR and the


middle 50% of the data is in this interval.

The whiskers attempt to capture all of the data remaining


outside of the box, except outliers.

Is it possible to identify skew from the box plot?


Example 1
Consider the following data set:

5, 5, 9, 10, 15, 16, 20, 30, 80

Find the 5-number summary and identify how small or


large a value would need to be, to be considered an
outlier. Are there any outliers in this data set?

Q2= 15 Q3-Q1= 25 - 7= 18

Q1= 7
Q1 - 1.5*IQR = -20
Q3= 25
Q3 + 1.5*IQR = 52
min= 5

max= 80
Example 2
Consider the following data set:

5, 8, 1, 19, 3, 1, 11, 18, 20, 5

Find the 5-number summary and identify how small or


large a value would need to be, to be considered an
outlier. Are there any outliers in this data set?

1, 1, 3, 5, 5, 8, 11, 18, 19, 20

Q2= 6.5 Q3-Q1= 18 - 3= 15

Q1= 3
Q1 - 1.5*IQR = -19.5
Q3= 18 Q3 + 1.5*IQR = 40.5

min= 1 max= 20
Rules of thumb for identifying outliers

There are two rules of thumb for identifying outliers:

• More than 1.5* IQR below Q1 or above Q3

• More than 2 standard deviations above or below the mean.

The median and IQR are called robust


Which is more affected by extreme observations, the mean or
estimates e because extreme observations
median?
have little effect on their values. The mean
and standard deviation are much more
Is the standard deviation or IQR more affected by extreme
affected by changes in extreme
observations?
observations.
Relations Between Variables (bivariate analysis)

A pair of variables are either related in some way (associated)


or not (independent). No pair of variables are both associated
and independent.

Some examples of associated variables?

Some examples of independent variables?


Relations Between Variables

Educational Attainment of Householder Estimate median income

No high school diploma 36,230

High school, no college 53,510

Some college 71,420

Bachelor's degree or higher 123,000

Are these variables associated? How would you describe the association? Who is affecting whom?
Explanatory and response variable

Might affect
Explanatory variable(s) Response variable
(Independent variable) (Dependent variable)

 Association doesn’t imply causation

 Association is claimed in observational studies (no interference in how data arise)

 Causation is claimed in experimental studies (randomization, control group vs


experimental group).
Plotting independent and dependent variables

Some trends can be


found when plotting
data.

https://isaim2018.cs.ou.edu/papers/ISAIM2018_Deebani_Kachouie.pdf
Simple Linear Regression Model

Lea (1965) discussed the relationship between mean annual temperature and a mortality index for a type
of breast cancer in women. The data taken from certain regions of Great Britain, Norway, and Sweden,
consist of the mean annual temperature (in degrees Fahrenheit), and a mortality index for neoplasms of
the female breast.

What should be the first step in analyzing any possible relationship between mean annual temperature and
mortality index?
Let’s make a scatter plot
What is this plot revealing?

This linear relationship can


be expressed in the model:

β0 and β1, parameters, regression


coefficients.
Β0 is the intercept.
Β1 is the slope. Change in y for a-
unit change in x.
x, the predictor.
y, the response.
Least squares regression (LSE)

Residuals: the difference between the observed response yi and the


fitted value 𝑦𝑖.

The residuals are expressed:

ei = yi − 𝑦𝑖, i = 1, . . . , N,

The best-fitted linear regression line minimizes the sum of squared


residuals:
Least squares regression (LSE)

The fitted model is

What is the fitted model then?


LSE Model for breast cancer mortality

The fitted regression line for the breast cancer


mortality data is:

𝑀 = −21.79 + 2.36 𝑇

Β0= - 21.79, How do you interpret Β0 and Β1

Β1= -2.36

What is the average mortality index due to breast


cancer at a location that has a mean annual
temperature of 49F?
𝑀 = −21.79 + 2.36 (49)
= 93.85
Pearson (r) Correlation

Besides plotting, Pearson (r) correlation is a size effect measure that can be used to assess the linear
relationship between two variables, and the direction of it.

r = 0 means there is no correlation


r = 1 means there is a perfect positive correlation
r = -1 means there is a perfect negative
correlation

1
𝑟= 𝑍 𝑍
𝑛−1

zxi = (xi – 𝑥̅ )/SDx

zYi = (yi – 𝑦)/SDy


What is the correlation between M and T for breast cancer mortality data

 x= c(102.5, 104.5, 100.4, 95.9, 87, 95, 88.6, 89.2, 78.9, 84.6, 81.7, 72.2, 65.1, 68.1, 67.3, 52.5)

> y= c(51.3, 49.9, 50, 49.2, 48.5, 47.8, 47.3, 45.1, 46.3, 42.1, 44.2, 43.5, 42.3, 40.2, 31.8, 34)

 Using functions set in R  Creating your own function


Studio by default in R Studio:

mean(x)= 83.34375 > correlation<- function(x, y){


> z=(x-mean(x))/sd(x)
mean(y)= 44.59375 > w=(y-mean(y))/sd(y)
> r= (1/(n-1))*sum(z*w)
sd(x)= 15.04757 > return(r)}
sd(y)= 5.583603 > correlation(x,y)
corr(x, y) = 0.8748544 0.8748544

You might also like