0% found this document useful (0 votes)
38 views22 pages

U1 Exploring One-Variable Data

Uploaded by

jamesqiaolei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views22 pages

U1 Exploring One-Variable Data

Uploaded by

jamesqiaolei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

🗃️

U1: Exploring One-Variable Data


1.1 Intro | What is Statistics?
1. Put simply, it’s drawing data from a sample or population and drawing
conclusions with it.

2. More specifically: statistics is the science/branch of applied mathematics


behind developing and studying methods of collecting, analyzing &
interpreting data.

3. Data is often taken from samples. These samples may represent a portion of a
larger group or a limited number of instances of a general phenomenon. We
can use samples to draw conclusions about larger groups & general events.

4. Probability also plays a big role in statistics. For example, any type of data
collection is subject to variation. If the same measurement were repeated,
then the answer would probably change. Statisticians attempt to understand
and control the sources of variation in any situation.

1.2 | Variables
1. A variable is a characteristic that changes from one individual to another

a. Individual: can be a person, place, thing…

2. Categorical variables: these variables take on values that are category names
or group labels, responses can be separated into diff categories

a. E.g. age groups, college majors, zip codes (exception)

b. →Answers to these questions are words

U1: Exploring One-Variable Data 1


3. Quantitative variables: these variables take on numerical values for a
measured or counted quantity

a. E.g. height, age, # of blue skittles in a packet

b. →Answers can be measured or counted, with a unit afterwards & can be


averaged

1.3 | Categorical Data: Frequency Tables


1. Frequency tables give the number of cases falling into each category, AKA a
list of diff categories & counts

a. Frequency = counts

2. Relative frequency table: gives the proportion of cases falling into each
category

a. Proportion: part divided by whole. Same as decimal form of a percentage.

b. Puts frequencies relative to the whole

3. Vocab: percentage, relative frequency & rate = a proportion written in different


ways

a. E.g. 50%, 1/2, .50

b. Distribution: of a variable tells us what values the variable takes and the
frequency of those values

1.4 | Representing Categorical Variables


with Graphs
Bar Graphs (barplots)

1. Only used for categorical frequency/relative frequency variables

a. y-axis: frequency

b. x-axis: categories

2. Relative bar graph: represents relative frequencies (proportions) on the y-axis

U1: Exploring One-Variable Data 2


Mosaic Plots

Pie Charts

1. Also only used for categorical variables, and only with proportions

Note: it’s better to use proportions when comparing data sets from different
sample sizes (and kinda in general)

+Qual graphs don’t need to be in any specific order

Two-Way/Contingency Tables

1. A frequency table used for organizing data, for a data set with two categorical
variables (e.g. boy & girl), with a total count for every variable

a. Variable at top, values for each variable below

2. Used for understanding a relationship between categorical variables

3. Table where row has n variables and variable column has m values is an n × m
table. The sum of the column entries are the marginal totals

U1: Exploring One-Variable Data 3


4. Conditional distributions: going down or across a certain column to keep the
variable constant (thus conditional). Allows for answer analyzation

a. E.g. looking only at the answers for a certain response group and
analyzing the diff (e.g. girls who drink coffee vs. girls who drink tea
answer frequency)

b. Can be used to find conditional distribution probability

c. Formula: finding a fixed row and dividing the values by the total for that
column

5. Marginal distributions: looking at only the totals for more contextual


information.

6. Formula: divide the row or column totals by the overall total

1.5 | Representing a Quantitative Variable


with Graphs
1. A lot more graphs to use

2. Two types of quan. variables

a. Discrete quantitative variable: variables can take on a countable number


of values. Can be finite or countably infinite (but realistic).

i. E.g. # of wins, # of skittles in a packet

b. Continuous quantitative variable: variables can take on infinitely many


values that cannot be counted, but measured. Like on a number line—

U1: Exploring One-Variable Data 4


possible values are endless; there is always another value in-between
variables, no matter how small the interval

i. E.g. concentration of salt in a water sample, height , weight

ii. Because it’s measured, the decimal can go on forever

Graph Rules

1. Quan graphs must start with an ordered number line that represents all
possible values a variable can take. The starting & ending number can be
whatever, as long as it covers all the data

2. The graphs must also represent the frequency of each value

Dot Plots

1. Best used for discret variables (whole numbers)

2. Don’t necessarily need a y-axis

Stemplots

1. Best with discrete variables

2. Each number represents the frequency, so we can have repeat digits

3. Must have a key with units

U1: Exploring One-Variable Data 5


2. Constructing a stemplot: take a list of numbers, put all the first digits as stems
and all the following digits as leaves. If the number is 7, then put 0 as a stem
and the 7 as a leaf.

3. This type of graph is useful since it can illustrate a bell-curve, and we can
visually see if there are any outliers

a. E.g. with the bird graph above, we can see that the average number of
birds is in the 10-20 range

4. Split stem & leaf plot: used for big groups of data. Numbers repeat in the stem
section, which organizes data (e.g. there are 30 diff values and they all lay
within the 30-40 range)

Back-to-Back Stemplots

1. Also needs a key

2. Discrete

3. Used to compare two diff sets of data

U1: Exploring One-Variable Data 6


4. Same perks as a regular stem & leaf plot

5. IMPORTANT: The stem column is ALWAYS the first digit. So you would read
right→left for the leftmost column

Histograms

1. Best for continuous variables

2. x-axis: creates number line of all values and bins for them to fall into

3. y-axis: represents frequency of data values that fall into each interval

4. No spaces—a gap only occurs when there are no values in a bin

U1: Exploring One-Variable Data 7


5. Doesn’t give specific values, just like bar graphs

6. There can also be relative histograms

Boxplots

Sample Questions on Graph Interpretation


1. Finding the proportion of values beyond a certain point

U1: Exploring One-Variable Data 8


1.6 | Describing the Distribution of a
Quantitative Variable
1. Three key features to examine:

a. Shape

b. Centre

c. Spread (variability)

2. Can also have outliers

Shape
1. Skewed right (positive skew): right tail is longer on right

a. More data on left, thus skewed to right→less data on right

2. Skewed left (negative skew): left tail is longer than right

3. Symmetric: left & right halves are symmetrical

a. Can peak at the centre, but can also cave-in at the centre

4. Peaks: unimodal, bimodal & uniform (no noticeable peak)

a. Peaks: most frequent

b. Usually used with symmetric descriptors

Centre
1. One value that can describe all of the data

a. AKA value with the most frequency

Spread/Variability
1. Discuss in simple terms the range the data values fall in and where the
majority of the data falls, plus the variability (don’t forget units!)

U1: Exploring One-Variable Data 9


2. Compare two diff distributions of the same variable

3. Interpretation: A had more spread, B had less variability, B probably more


accurate

Outliers & Gaps


1. Outliers: Data points that are unusually large/small relative to the rest of the
data. A lack of outliers is also noticeable/important

2. Gaps: a region of a distribution between two data values where there are no
observed data. Can mean different things.

*Remember that every aspect of a graph is of note

U1: Exploring One-Variable Data 10


1.7 | Summary Statistics for a Quantitative
Variable
1. Quan variables can be represented with a graph that can be analyzed &
summarized with words.

2. However, data can also be analyzed & summarized with numbers.

3. A number that describes or summarizes a set of quan data from a sample is a


statistic.

4. A number that describes or summarizes a set of quan data from a population


is a parameter.

5. Both do the same thing

Statistics
1. Are used to summarize:

a. Centre of data

b. Other positions in the data

c. The spread of data

d. Identify outliers in data

2. No statistics tell us the shape of data, but it can hint at it

Summarizing the Centre of a Data Set


1. Mean: the same of all data divided by number of data. The symbol for the
mean is the “x-bar” (x̄ ).

a. The mean is easily impacted by outliers.

b. Will move towards tail with skewed data, even when there’s more data on
the other side. A few extreme numbers will impact it greatly.

U1: Exploring One-Variable Data 11


c. Sits in the middle with symmetric data (e.g. bell curve)

2. Median: dead-centre of a dataset.

a. AKA the second quartile.

b. Not impacted by outliers or values, it only represents the midpoint of a


dataset; however, the data must be from smallest-greatest.

c. Median POSITION formula: (n+1)/2

i. E.g. if the answer is 5, then the median is located at the fifth position
of your data set (in increasing order.) It does NOT mean that the
median value is 5.

ii. If the answer contains a decimal, then it is in-between the two closest
values. E.g. 3.5 means the median is between the 3rd and 4th position.

🍘 CALC TUTORIAL FOR FINDING THE MEAN & MEDIAN:

1. STAT→1

2. Input list of numbers

3. STAT→CALC→1

4. List: L1, L2… (2nd # to choose)

5. FreqList is blank

6. Calculate

7. Mean is first, scroll down for “Med”

Finding the Mean and Median in a Graph


1. Mean: you cannot, since you don’t have exact numbers.

a. HOWEVER, we can use graph analysation techniques to gauge where it


might be

U1: Exploring One-Variable Data 12


2. Median: you still can’t determine the exact value.

a. HOWEVER, you can determine its LOCATION using the formula

c. After you get an answer, add up the frequency values until you get to a
“bin” that includes the median location.

d. For this graph, we now know that the median is located in the $13-15 bin.

3. Tip: now that you know the position of the median, you can gauge the location
of the mean. In this case, because the graph skews right, the mean will be
higher than the median.

4. For symmetric & bimodal graphs, both the mean and median will fall in the
centre

5. For skewed graphs, the med will always fall in the centre (50% of data below,
50% above)

U1: Exploring One-Variable Data 13


Measuring Other Positions with Numbers
1. Includes

a. Percentile

b. First quartile, second quartile & third quartile

c. Minimum & maximum

2. Percentile: The pth percentile is interpreted as the value that has p% of the
data less than it.

a. E.g. being at the 85th percentile for SAT scores. This means that 85% of
students scored below your score, and that 15% scored better than you.
(So you’re 85% in the dataset.)

b. There is another name for being at or above.

c. The median will always be the 50th percentile, with 50% of data below it.

d. To find a data value’s percentile (and quartile): count how many values
are at or below the value of interest. Then divide by the total number of
values.

U1: Exploring One-Variable Data 14


3. Quartiles: names for distinguishing specific percentiles

a. First (Q1): 25th percentile

b. Second (Q2): 50th percentile

c. Third (Q3): 75th percentile

d. There is 50% of data between Q1 & Q3

e. The Q1 & Q3 are median points for the lower & upper half of a dataset,
respectively

f. Altogether divides dataset into 25% even blocks

4. The same steps apply on calc to find percentiles & quartiles in data.

a. Five-number summary: min, max, q1, med, q3

Finding Percentiles & Quartiles in Graphs


1. Still, you can’t find exact values, but you can locate them

2. Q1 is between the 25th & 26th value

a. TIP: Think of quartiles as medians

3. Count from top to 25th & 26th value for Q3

U1: Exploring One-Variable Data 15


Dot Plot Example

4. Means that 44% of cars had fuel economy at or below 30mpg

U1: Exploring One-Variable Data 16


Measuring the Variability (Spread) of Data
1. A single number that tells you how much the data varies

a. Range

b. Interquartile Range

c. Standard Deviation

2. Range: the diff between max and min value.

a. Bigger range means data is more spread out

b. Smaller range means data is less spread out

c. Easily influenced by outliers, so may be misleading. Thus it is not used


often to describe spread of data

3. IQR: measure of the spread of middle 50% of data only. Aka the difference
between first and third quartiles, the spread of Q1-Q3.

a. Not influenced by outliers (good)

b. Finding IQR: Q3-Q1

c. Smaller IQR means middle 50% of data is clustered together, not spread
out

d. Larger IQR means the middle 50% is very spread out

4. Standard Deviation (s): the typical distance a data value is from the mean

a. Small standard deviation means most data is very close to the mean, un-
spread out

b. Large standard deviation means that most data is far from the mean, thus
more spread out

c. Most data is within one standard deviation of the mean (e.g. one standard
deviation higher→s=2→plus or minus 2 higher)

d. E.g. S=4 and x̄ =10. Most data is from 6-14. (Add/subtract s from x̄ )

e. Not all! Doesn’t include large outliers (good)

U1: Exploring One-Variable Data 17


f. How to find: range^2, add range^2, multiply by one/(n-1), take square root
of everything… never have to actually calculate this (fu AP psych)

g. Mean 🤝 standard deviation (always together)


h. Sx on calc is standard deviation

Comparing Variability in Graphs


1. Can’t calculate exact standard deviation without data values, but can compare
graphs by asking “how far from the mean (aka how big is the S) is the data?”

2. Symmetric graphs 2 ways

a. The mean and medium is gonna be the same in the centre, but graph A will
have a larger spread

★ Important Reminders ★
1. Bc the mean can be affected by skewed graphs and outliers, the S can also be
affected since the S revolves around the mean. This means the S can be
bigger as well, thus it’s better to use the median and IQR for those graphs
(only focuses on centre.)

U1: Exploring One-Variable Data 18


2. Also, if a mean is higher than a median, then the graph is skewed right (and
vice versa if lower)

3. If the mean and medium is similar, then the data is symmetrical

4. Five-number Summary: Min, Q1, Med, Q3, Max

Finding Outliers in a Data Set


1. Two diff ways

2. Quartile method (official): used if the quartiles (Q1 & Q3) are given

a. Find the upper and lower fence—being higher than upper fence and lower
than lower fence means that’s an outlier

b. Finding fences:

U1: Exploring One-Variable Data 19


3. Mean & standard deviation method: since most data is within 1 standard
deviation of the mean, then if a value is 2 (or even 3) standard deviations away
from the mean, then it can be considered an outlier

a. Used when quartiles aren’t given/clear

Incorporated Example

4. Negative fence number means no outliers

Transforming Data Impacts


1. Adding or subtracting a constant value from all values (e.g. test curve): what
happens to the summary statistics?

a. Measures of centre (mean & median): will be moved by the same amount

U1: Exploring One-Variable Data 20


b. Measures of position (percentiles & Q1+Q3): will go up or down by the
same amount

c. Measures of variability (range, IQR & S): will not be affected at all

i. Side article on test curve types: https://www.thoughtco.com/grading-


on-a-curve-3212063

2. Multiplying every value by a constant (e.g. converting units): every statistic


will be multiplied by the same constant

3. TL;DR: Addition and subtraction doesn’t affect range, IQR or S, but


multiplication will

The 4-Step Process in Statistics


1. Remember that statistics is used for interpretation, which is then used to
answer questions

2. An easy way to do so is to remember the 4-step process

a. Step 1 (State): Ask a question that can be answered with sample data.

b. Step 2 (Plan): Determine what information is needed.

U1: Exploring One-Variable Data 21


c. Step 3 (Do): Collect sample data that is representative of the population.

d. Step 4 (Conclude): Summarize, interpret and analyze the sample data.

U1: Exploring One-Variable Data 22

You might also like