U1 Exploring One-Variable Data
U1 Exploring One-Variable Data
3. Data is often taken from samples. These samples may represent a portion of a
larger group or a limited number of instances of a general phenomenon. We
can use samples to draw conclusions about larger groups & general events.
4. Probability also plays a big role in statistics. For example, any type of data
collection is subject to variation. If the same measurement were repeated,
then the answer would probably change. Statisticians attempt to understand
and control the sources of variation in any situation.
1.2 | Variables
1. A variable is a characteristic that changes from one individual to another
2. Categorical variables: these variables take on values that are category names
or group labels, responses can be separated into diff categories
a. Frequency = counts
2. Relative frequency table: gives the proportion of cases falling into each
category
b. Distribution: of a variable tells us what values the variable takes and the
frequency of those values
a. y-axis: frequency
b. x-axis: categories
Pie Charts
1. Also only used for categorical variables, and only with proportions
Note: it’s better to use proportions when comparing data sets from different
sample sizes (and kinda in general)
Two-Way/Contingency Tables
1. A frequency table used for organizing data, for a data set with two categorical
variables (e.g. boy & girl), with a total count for every variable
3. Table where row has n variables and variable column has m values is an n × m
table. The sum of the column entries are the marginal totals
a. E.g. looking only at the answers for a certain response group and
analyzing the diff (e.g. girls who drink coffee vs. girls who drink tea
answer frequency)
c. Formula: finding a fixed row and dividing the values by the total for that
column
Graph Rules
1. Quan graphs must start with an ordered number line that represents all
possible values a variable can take. The starting & ending number can be
whatever, as long as it covers all the data
Dot Plots
Stemplots
3. This type of graph is useful since it can illustrate a bell-curve, and we can
visually see if there are any outliers
a. E.g. with the bird graph above, we can see that the average number of
birds is in the 10-20 range
4. Split stem & leaf plot: used for big groups of data. Numbers repeat in the stem
section, which organizes data (e.g. there are 30 diff values and they all lay
within the 30-40 range)
Back-to-Back Stemplots
2. Discrete
5. IMPORTANT: The stem column is ALWAYS the first digit. So you would read
right→left for the leftmost column
Histograms
2. x-axis: creates number line of all values and bins for them to fall into
3. y-axis: represents frequency of data values that fall into each interval
Boxplots
a. Shape
b. Centre
c. Spread (variability)
Shape
1. Skewed right (positive skew): right tail is longer on right
a. Can peak at the centre, but can also cave-in at the centre
Centre
1. One value that can describe all of the data
Spread/Variability
1. Discuss in simple terms the range the data values fall in and where the
majority of the data falls, plus the variability (don’t forget units!)
2. Gaps: a region of a distribution between two data values where there are no
observed data. Can mean different things.
Statistics
1. Are used to summarize:
a. Centre of data
b. Will move towards tail with skewed data, even when there’s more data on
the other side. A few extreme numbers will impact it greatly.
i. E.g. if the answer is 5, then the median is located at the fifth position
of your data set (in increasing order.) It does NOT mean that the
median value is 5.
ii. If the answer contains a decimal, then it is in-between the two closest
values. E.g. 3.5 means the median is between the 3rd and 4th position.
1. STAT→1
3. STAT→CALC→1
5. FreqList is blank
6. Calculate
c. After you get an answer, add up the frequency values until you get to a
“bin” that includes the median location.
d. For this graph, we now know that the median is located in the $13-15 bin.
3. Tip: now that you know the position of the median, you can gauge the location
of the mean. In this case, because the graph skews right, the mean will be
higher than the median.
4. For symmetric & bimodal graphs, both the mean and median will fall in the
centre
5. For skewed graphs, the med will always fall in the centre (50% of data below,
50% above)
a. Percentile
2. Percentile: The pth percentile is interpreted as the value that has p% of the
data less than it.
a. E.g. being at the 85th percentile for SAT scores. This means that 85% of
students scored below your score, and that 15% scored better than you.
(So you’re 85% in the dataset.)
c. The median will always be the 50th percentile, with 50% of data below it.
d. To find a data value’s percentile (and quartile): count how many values
are at or below the value of interest. Then divide by the total number of
values.
e. The Q1 & Q3 are median points for the lower & upper half of a dataset,
respectively
4. The same steps apply on calc to find percentiles & quartiles in data.
a. Range
b. Interquartile Range
c. Standard Deviation
3. IQR: measure of the spread of middle 50% of data only. Aka the difference
between first and third quartiles, the spread of Q1-Q3.
c. Smaller IQR means middle 50% of data is clustered together, not spread
out
4. Standard Deviation (s): the typical distance a data value is from the mean
a. Small standard deviation means most data is very close to the mean, un-
spread out
b. Large standard deviation means that most data is far from the mean, thus
more spread out
c. Most data is within one standard deviation of the mean (e.g. one standard
deviation higher→s=2→plus or minus 2 higher)
d. E.g. S=4 and x̄ =10. Most data is from 6-14. (Add/subtract s from x̄ )
a. The mean and medium is gonna be the same in the centre, but graph A will
have a larger spread
★ Important Reminders ★
1. Bc the mean can be affected by skewed graphs and outliers, the S can also be
affected since the S revolves around the mean. This means the S can be
bigger as well, thus it’s better to use the median and IQR for those graphs
(only focuses on centre.)
2. Quartile method (official): used if the quartiles (Q1 & Q3) are given
a. Find the upper and lower fence—being higher than upper fence and lower
than lower fence means that’s an outlier
b. Finding fences:
Incorporated Example
a. Measures of centre (mean & median): will be moved by the same amount
c. Measures of variability (range, IQR & S): will not be affected at all
a. Step 1 (State): Ask a question that can be answered with sample data.