0% found this document useful (0 votes)

38 views22 pages

U1 Exploring One-Variable Data

Uploaded by

jamesqiaolei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views22 pages

U1 Exploring One-Variable Data

Uploaded by

jamesqiaolei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

🗃️

U1: Exploring One-Variable Data

1.1 Intro | What is Statistics?
1. Put simply, it’s drawing data from a sample or population and drawing
conclusions with it.

2. More specifically: statistics is the science/branch of applied mathematics

behind developing and studying methods of collecting, analyzing &
interpreting data.

3. Data is often taken from samples. These samples may represent a portion of a
larger group or a limited number of instances of a general phenomenon. We
can use samples to draw conclusions about larger groups & general events.

4. Probability also plays a big role in statistics. For example, any type of data
collection is subject to variation. If the same measurement were repeated,
then the answer would probably change. Statisticians attempt to understand
and control the sources of variation in any situation.

1.2 | Variables
1. A variable is a characteristic that changes from one individual to another

a. Individual: can be a person, place, thing…

2. Categorical variables: these variables take on values that are category names
or group labels, responses can be separated into diff categories

a. E.g. age groups, college majors, zip codes (exception)

b. →Answers to these questions are words

U1: Exploring One-Variable Data 1

3. Quantitative variables: these variables take on numerical values for a
measured or counted quantity

a. E.g. height, age, # of blue skittles in a packet

b. →Answers can be measured or counted, with a unit afterwards & can be

averaged

1.3 | Categorical Data: Frequency Tables

1. Frequency tables give the number of cases falling into each category, AKA a
list of diff categories & counts

a. Frequency = counts

2. Relative frequency table: gives the proportion of cases falling into each
category

a. Proportion: part divided by whole. Same as decimal form of a percentage.

b. Puts frequencies relative to the whole

3. Vocab: percentage, relative frequency & rate = a proportion written in different

ways

a. E.g. 50%, 1/2, .50

b. Distribution: of a variable tells us what values the variable takes and the
frequency of those values

1.4 | Representing Categorical Variables

with Graphs
Bar Graphs (barplots)

1. Only used for categorical frequency/relative frequency variables

a. y-axis: frequency

b. x-axis: categories

2. Relative bar graph: represents relative frequencies (proportions) on the y-axis

U1: Exploring One-Variable Data 2

Mosaic Plots

Pie Charts

1. Also only used for categorical variables, and only with proportions

Note: it’s better to use proportions when comparing data sets from different
sample sizes (and kinda in general)

+Qual graphs don’t need to be in any specific order

Two-Way/Contingency Tables

1. A frequency table used for organizing data, for a data set with two categorical
variables (e.g. boy & girl), with a total count for every variable

a. Variable at top, values for each variable below

2. Used for understanding a relationship between categorical variables

3. Table where row has n variables and variable column has m values is an n × m
table. The sum of the column entries are the marginal totals

U1: Exploring One-Variable Data 3

4. Conditional distributions: going down or across a certain column to keep the
variable constant (thus conditional). Allows for answer analyzation

a. E.g. looking only at the answers for a certain response group and
analyzing the diff (e.g. girls who drink coffee vs. girls who drink tea
answer frequency)

b. Can be used to find conditional distribution probability

c. Formula: finding a fixed row and dividing the values by the total for that
column

5. Marginal distributions: looking at only the totals for more contextual

information.

6. Formula: divide the row or column totals by the overall total

1.5 | Representing a Quantitative Variable

with Graphs
1. A lot more graphs to use

2. Two types of quan. variables

a. Discrete quantitative variable: variables can take on a countable number

of values. Can be finite or countably infinite (but realistic).

i. E.g. # of wins, # of skittles in a packet

b. Continuous quantitative variable: variables can take on infinitely many

values that cannot be counted, but measured. Like on a number line—

U1: Exploring One-Variable Data 4

possible values are endless; there is always another value in-between
variables, no matter how small the interval

i. E.g. concentration of salt in a water sample, height , weight

ii. Because it’s measured, the decimal can go on forever

Graph Rules

1. Quan graphs must start with an ordered number line that represents all
possible values a variable can take. The starting & ending number can be
whatever, as long as it covers all the data

2. The graphs must also represent the frequency of each value

Dot Plots

1. Best used for discret variables (whole numbers)

2. Don’t necessarily need a y-axis

Stemplots

1. Best with discrete variables

2. Each number represents the frequency, so we can have repeat digits

3. Must have a key with units

U1: Exploring One-Variable Data 5

2. Constructing a stemplot: take a list of numbers, put all the first digits as stems
and all the following digits as leaves. If the number is 7, then put 0 as a stem
and the 7 as a leaf.

3. This type of graph is useful since it can illustrate a bell-curve, and we can
visually see if there are any outliers

a. E.g. with the bird graph above, we can see that the average number of
birds is in the 10-20 range

4. Split stem & leaf plot: used for big groups of data. Numbers repeat in the stem
section, which organizes data (e.g. there are 30 diff values and they all lay
within the 30-40 range)

Back-to-Back Stemplots

1. Also needs a key

2. Discrete

3. Used to compare two diff sets of data

U1: Exploring One-Variable Data 6

4. Same perks as a regular stem & leaf plot

5. IMPORTANT: The stem column is ALWAYS the first digit. So you would read
right→left for the leftmost column

Histograms

1. Best for continuous variables

2. x-axis: creates number line of all values and bins for them to fall into

3. y-axis: represents frequency of data values that fall into each interval

4. No spaces—a gap only occurs when there are no values in a bin

U1: Exploring One-Variable Data 7

5. Doesn’t give specific values, just like bar graphs

6. There can also be relative histograms

Boxplots

Sample Questions on Graph Interpretation

1. Finding the proportion of values beyond a certain point

U1: Exploring One-Variable Data 8

1.6 | Describing the Distribution of a
Quantitative Variable
1. Three key features to examine:

a. Shape

b. Centre

c. Spread (variability)

2. Can also have outliers

Shape
1. Skewed right (positive skew): right tail is longer on right

a. More data on left, thus skewed to right→less data on right

2. Skewed left (negative skew): left tail is longer than right

3. Symmetric: left & right halves are symmetrical

a. Can peak at the centre, but can also cave-in at the centre

4. Peaks: unimodal, bimodal & uniform (no noticeable peak)

a. Peaks: most frequent

b. Usually used with symmetric descriptors

Centre
1. One value that can describe all of the data

a. AKA value with the most frequency

Spread/Variability
1. Discuss in simple terms the range the data values fall in and where the
majority of the data falls, plus the variability (don’t forget units!)

U1: Exploring One-Variable Data 9

2. Compare two diff distributions of the same variable

3. Interpretation: A had more spread, B had less variability, B probably more

accurate

Outliers & Gaps

1. Outliers: Data points that are unusually large/small relative to the rest of the
data. A lack of outliers is also noticeable/important

2. Gaps: a region of a distribution between two data values where there are no
observed data. Can mean different things.

*Remember that every aspect of a graph is of note

U1: Exploring One-Variable Data 10

1.7 | Summary Statistics for a Quantitative
Variable
1. Quan variables can be represented with a graph that can be analyzed &
summarized with words.

2. However, data can also be analyzed & summarized with numbers.

3. A number that describes or summarizes a set of quan data from a sample is a

statistic.

4. A number that describes or summarizes a set of quan data from a population

is a parameter.

5. Both do the same thing

Statistics
1. Are used to summarize:

a. Centre of data

b. Other positions in the data

c. The spread of data

d. Identify outliers in data

2. No statistics tell us the shape of data, but it can hint at it

Summarizing the Centre of a Data Set

1. Mean: the same of all data divided by number of data. The symbol for the
mean is the “x-bar” (x̄ ).

a. The mean is easily impacted by outliers.

b. Will move towards tail with skewed data, even when there’s more data on
the other side. A few extreme numbers will impact it greatly.

U1: Exploring One-Variable Data 11

c. Sits in the middle with symmetric data (e.g. bell curve)

2. Median: dead-centre of a dataset.

a. AKA the second quartile.

b. Not impacted by outliers or values, it only represents the midpoint of a

dataset; however, the data must be from smallest-greatest.

c. Median POSITION formula: (n+1)/2

i. E.g. if the answer is 5, then the median is located at the fifth position
of your data set (in increasing order.) It does NOT mean that the
median value is 5.

ii. If the answer contains a decimal, then it is in-between the two closest
values. E.g. 3.5 means the median is between the 3rd and 4th position.

🍘 CALC TUTORIAL FOR FINDING THE MEAN & MEDIAN:

1. STAT→1

2. Input list of numbers

3. STAT→CALC→1

4. List: L1, L2… (2nd # to choose)

5. FreqList is blank

6. Calculate

7. Mean is first, scroll down for “Med”

Finding the Mean and Median in a Graph

1. Mean: you cannot, since you don’t have exact numbers.

a. HOWEVER, we can use graph analysation techniques to gauge where it

might be

U1: Exploring One-Variable Data 12

2. Median: you still can’t determine the exact value.

a. HOWEVER, you can determine its LOCATION using the formula

c. After you get an answer, add up the frequency values until you get to a
“bin” that includes the median location.

d. For this graph, we now know that the median is located in the $13-15 bin.

3. Tip: now that you know the position of the median, you can gauge the location
of the mean. In this case, because the graph skews right, the mean will be
higher than the median.

4. For symmetric & bimodal graphs, both the mean and median will fall in the
centre

5. For skewed graphs, the med will always fall in the centre (50% of data below,
50% above)

U1: Exploring One-Variable Data 13

Measuring Other Positions with Numbers
1. Includes

a. Percentile

b. First quartile, second quartile & third quartile

c. Minimum & maximum

2. Percentile: The pth percentile is interpreted as the value that has p% of the
data less than it.

a. E.g. being at the 85th percentile for SAT scores. This means that 85% of
students scored below your score, and that 15% scored better than you.
(So you’re 85% in the dataset.)

b. There is another name for being at or above.

c. The median will always be the 50th percentile, with 50% of data below it.

d. To find a data value’s percentile (and quartile): count how many values
are at or below the value of interest. Then divide by the total number of
values.

U1: Exploring One-Variable Data 14

3. Quartiles: names for distinguishing specific percentiles

a. First (Q1): 25th percentile

b. Second (Q2): 50th percentile

c. Third (Q3): 75th percentile

d. There is 50% of data between Q1 & Q3

e. The Q1 & Q3 are median points for the lower & upper half of a dataset,
respectively

f. Altogether divides dataset into 25% even blocks

4. The same steps apply on calc to find percentiles & quartiles in data.

a. Five-number summary: min, max, q1, med, q3

Finding Percentiles & Quartiles in Graphs

1. Still, you can’t find exact values, but you can locate them

2. Q1 is between the 25th & 26th value

a. TIP: Think of quartiles as medians

3. Count from top to 25th & 26th value for Q3

U1: Exploring One-Variable Data 15

Dot Plot Example

4. Means that 44% of cars had fuel economy at or below 30mpg

U1: Exploring One-Variable Data 16

Measuring the Variability (Spread) of Data
1. A single number that tells you how much the data varies

a. Range

b. Interquartile Range

c. Standard Deviation

2. Range: the diff between max and min value.

a. Bigger range means data is more spread out

b. Smaller range means data is less spread out

c. Easily influenced by outliers, so may be misleading. Thus it is not used

often to describe spread of data

3. IQR: measure of the spread of middle 50% of data only. Aka the difference
between first and third quartiles, the spread of Q1-Q3.

a. Not influenced by outliers (good)

b. Finding IQR: Q3-Q1

c. Smaller IQR means middle 50% of data is clustered together, not spread
out

d. Larger IQR means the middle 50% is very spread out

4. Standard Deviation (s): the typical distance a data value is from the mean

a. Small standard deviation means most data is very close to the mean, un-
spread out

b. Large standard deviation means that most data is far from the mean, thus
more spread out

c. Most data is within one standard deviation of the mean (e.g. one standard
deviation higher→s=2→plus or minus 2 higher)

d. E.g. S=4 and x̄ =10. Most data is from 6-14. (Add/subtract s from x̄ )

e. Not all! Doesn’t include large outliers (good)

U1: Exploring One-Variable Data 17

f. How to find: range^2, add range^2, multiply by one/(n-1), take square root
of everything… never have to actually calculate this (fu AP psych)

g. Mean 🤝 standard deviation (always together)

h. Sx on calc is standard deviation

Comparing Variability in Graphs

1. Can’t calculate exact standard deviation without data values, but can compare
graphs by asking “how far from the mean (aka how big is the S) is the data?”

2. Symmetric graphs 2 ways

a. The mean and medium is gonna be the same in the centre, but graph A will
have a larger spread

★ Important Reminders ★
1. Bc the mean can be affected by skewed graphs and outliers, the S can also be
affected since the S revolves around the mean. This means the S can be
bigger as well, thus it’s better to use the median and IQR for those graphs
(only focuses on centre.)

U1: Exploring One-Variable Data 18

2. Also, if a mean is higher than a median, then the graph is skewed right (and
vice versa if lower)

3. If the mean and medium is similar, then the data is symmetrical

4. Five-number Summary: Min, Q1, Med, Q3, Max

Finding Outliers in a Data Set

1. Two diff ways

2. Quartile method (official): used if the quartiles (Q1 & Q3) are given

a. Find the upper and lower fence—being higher than upper fence and lower
than lower fence means that’s an outlier

b. Finding fences:

U1: Exploring One-Variable Data 19

3. Mean & standard deviation method: since most data is within 1 standard
deviation of the mean, then if a value is 2 (or even 3) standard deviations away
from the mean, then it can be considered an outlier

a. Used when quartiles aren’t given/clear

Incorporated Example

4. Negative fence number means no outliers

Transforming Data Impacts

1. Adding or subtracting a constant value from all values (e.g. test curve): what
happens to the summary statistics?

a. Measures of centre (mean & median): will be moved by the same amount

U1: Exploring One-Variable Data 20

b. Measures of position (percentiles & Q1+Q3): will go up or down by the
same amount

c. Measures of variability (range, IQR & S): will not be affected at all

i. Side article on test curve types: https://www.thoughtco.com/grading-

on-a-curve-3212063

2. Multiplying every value by a constant (e.g. converting units): every statistic

will be multiplied by the same constant

3. TL;DR: Addition and subtraction doesn’t affect range, IQR or S, but

multiplication will

The 4-Step Process in Statistics

1. Remember that statistics is used for interpretation, which is then used to
answer questions

2. An easy way to do so is to remember the 4-step process

a. Step 1 (State): Ask a question that can be answered with sample data.

b. Step 2 (Plan): Determine what information is needed.

U1: Exploring One-Variable Data 21

c. Step 3 (Do): Collect sample data that is representative of the population.

d. Step 4 (Conclude): Summarize, interpret and analyze the sample data.

U1: Exploring One-Variable Data 22

Ip Project
50% (2)
Ip Project
41 pages
STAB22 Lecture's Notes
No ratings yet
STAB22 Lecture's Notes
64 pages
Data Analysis With Databricks
75% (4)
Data Analysis With Databricks
80 pages
IE 220 Probability and Statistics: Descriptive Statistics - Graphical Summary: Describing Data With Graphs
No ratings yet
IE 220 Probability and Statistics: Descriptive Statistics - Graphical Summary: Describing Data With Graphs
36 pages
STAT 111: Introduction To Statistics & Probability For Actuaries
100% (2)
STAT 111: Introduction To Statistics & Probability For Actuaries
230 pages
AP Stats Semester 1 Finals Prep
No ratings yet
AP Stats Semester 1 Finals Prep
4 pages
Chapter 1 - Introduction To Statistics
No ratings yet
Chapter 1 - Introduction To Statistics
38 pages
Introduction To Probability and Statistics Thirteenth Edition
No ratings yet
Introduction To Probability and Statistics Thirteenth Edition
30 pages
WEEK1
No ratings yet
WEEK1
36 pages
Stat 101
100% (4)
Stat 101
25 pages
Session1 Probability
No ratings yet
Session1 Probability
122 pages
AP Statistics Study
No ratings yet
AP Statistics Study
76 pages
Introduction To Probability and Statistics
No ratings yet
Introduction To Probability and Statistics
30 pages
Chapter 1
No ratings yet
Chapter 1
67 pages
Video Notes Unit 2
No ratings yet
Video Notes Unit 2
16 pages
Chapter 01
No ratings yet
Chapter 01
31 pages
5.1 Visual Displays of Data
No ratings yet
5.1 Visual Displays of Data
8 pages
Introduction To Probability and Statistics Twelfth Edition
No ratings yet
Introduction To Probability and Statistics Twelfth Edition
31 pages
Variables & Chart
No ratings yet
Variables & Chart
60 pages
1st Unit Notes
No ratings yet
1st Unit Notes
22 pages
Statistics Midterms Reviewer 1
No ratings yet
Statistics Midterms Reviewer 1
9 pages
CH 1
No ratings yet
CH 1
40 pages
Chapter 1 Data Analysis Making Sense of Data
No ratings yet
Chapter 1 Data Analysis Making Sense of Data
55 pages
Lecture 2-Data Description
No ratings yet
Lecture 2-Data Description
80 pages
Unit - III Univariate Analysis
No ratings yet
Unit - III Univariate Analysis
33 pages
RM Data Analysis
No ratings yet
RM Data Analysis
67 pages
Data Managementmmw
No ratings yet
Data Managementmmw
26 pages
Topic 1 Descriptive Statistics SV
No ratings yet
Topic 1 Descriptive Statistics SV
113 pages
Unit 01 Statistics
No ratings yet
Unit 01 Statistics
10 pages
2. presenting of data - ١١١٠٥٩
No ratings yet
2. presenting of data - ١١١٠٥٩
39 pages
Module 2 Review Exploring Data With Graphs
No ratings yet
Module 2 Review Exploring Data With Graphs
19 pages
Unit 1 Notes 1pp
No ratings yet
Unit 1 Notes 1pp
128 pages
Unit-2 MFAI
No ratings yet
Unit-2 MFAI
118 pages
Chapter 01
No ratings yet
Chapter 01
30 pages
M 301 - Ch1 - Introduction To Statistics
No ratings yet
M 301 - Ch1 - Introduction To Statistics
96 pages
Data Summary and Presentation 1
100% (1)
Data Summary and Presentation 1
39 pages
Lecture 1
No ratings yet
Lecture 1
94 pages
1st Mid
No ratings yet
1st Mid
19 pages
Lecture 01 Introduction To Statistics PPT 06022025 095924am
No ratings yet
Lecture 01 Introduction To Statistics PPT 06022025 095924am
40 pages
STAT Lec1 2023
No ratings yet
STAT Lec1 2023
27 pages
Collection of Data Part 2 Edited MLIS
No ratings yet
Collection of Data Part 2 Edited MLIS
45 pages
SLIDES Statistics-Chapter 2
No ratings yet
SLIDES Statistics-Chapter 2
31 pages
Bustat Reviewer
No ratings yet
Bustat Reviewer
6 pages
ADDB - Week 1
No ratings yet
ADDB - Week 1
44 pages
Stats For PGDM
No ratings yet
Stats For PGDM
52 pages
Probability+&+Statistics Formulas
No ratings yet
Probability+&+Statistics Formulas
47 pages
Guiang Mamow Paper 1 Statistical Terms
No ratings yet
Guiang Mamow Paper 1 Statistical Terms
5 pages
Statistics For Begineers
No ratings yet
Statistics For Begineers
28 pages
Introduction To Stati Stics: There Are Three Kinds of Lies: Lies, Damned Lies, A ND Statistics." (B.Disraeli)
No ratings yet
Introduction To Stati Stics: There Are Three Kinds of Lies: Lies, Damned Lies, A ND Statistics." (B.Disraeli)
39 pages
Engineering Data Analysis
No ratings yet
Engineering Data Analysis
4 pages
Ed242 Lec2a Review Data
No ratings yet
Ed242 Lec2a Review Data
21 pages
Ns Statistics 2022
No ratings yet
Ns Statistics 2022
70 pages
Summary T1 L1 Types of Data
No ratings yet
Summary T1 L1 Types of Data
3 pages
Variables and Data Presentation
No ratings yet
Variables and Data Presentation
64 pages
Overview: Describing and Interpreting Data: Variable
No ratings yet
Overview: Describing and Interpreting Data: Variable
5 pages
Statistic Reviewer
No ratings yet
Statistic Reviewer
9 pages
Basics of Statistics - 3
No ratings yet
Basics of Statistics - 3
5 pages
AP Stats Slides
No ratings yet
AP Stats Slides
55 pages
Introduction To Statistics: "There Are Three Kinds of Lies: Lies, Damned Lies, and Statistics." (B.Disraeli)
No ratings yet
Introduction To Statistics: "There Are Three Kinds of Lies: Lies, Damned Lies, and Statistics." (B.Disraeli)
32 pages
Statistics I Essentials
From Everand
Statistics I Essentials
Emil G. Milewski
No ratings yet
Co-Clustering: Models, Algorithms and Applications
From Everand
Co-Clustering: Models, Algorithms and Applications
Gérard Govaert
No ratings yet
Fundamentals of Modern Mathematics: A Practical Review
From Everand
Fundamentals of Modern Mathematics: A Practical Review
David B. MacNeil
No ratings yet
DMS Microproject
No ratings yet
DMS Microproject
30 pages
Access Task 04 Instructions
No ratings yet
Access Task 04 Instructions
4 pages
GL122 Probability and Statistics 2019 1
No ratings yet
GL122 Probability and Statistics 2019 1
6 pages
16 Data Mining Techniques - The Complete List - Talend
No ratings yet
16 Data Mining Techniques - The Complete List - Talend
9 pages
Database Notes
No ratings yet
Database Notes
47 pages
Remove Data Guard Broker Configuration Safely: Primary Database
No ratings yet
Remove Data Guard Broker Configuration Safely: Primary Database
8 pages
Aws General
No ratings yet
Aws General
325 pages
Mainframe Interview QA
No ratings yet
Mainframe Interview QA
8 pages
Database Management System: Name: Krishna A Patel
No ratings yet
Database Management System: Name: Krishna A Patel
17 pages
Report Mohi
No ratings yet
Report Mohi
69 pages
Stat - Assignment
No ratings yet
Stat - Assignment
2 pages
jBASE Indexing
No ratings yet
jBASE Indexing
35 pages
CS Xii PB MS - Set1
No ratings yet
CS Xii PB MS - Set1
6 pages
SS 2 Data Processing 1ST Term 20172018 Exam
No ratings yet
SS 2 Data Processing 1ST Term 20172018 Exam
9 pages
Data Engineering SQL Concepts - Mindmap
No ratings yet
Data Engineering SQL Concepts - Mindmap
1 page
HR Employee
No ratings yet
HR Employee
5 pages
Transaction Processing
No ratings yet
Transaction Processing
14 pages
M.TechSE - Curriculum and Syllabus 2020-21
No ratings yet
M.TechSE - Curriculum and Syllabus 2020-21
228 pages
Chapter 3 Module 3 Cc105
No ratings yet
Chapter 3 Module 3 Cc105
28 pages
Big Data Answers
No ratings yet
Big Data Answers
14 pages
Full Stack Developer JAVA
No ratings yet
Full Stack Developer JAVA
14 pages
Unit 5
No ratings yet
Unit 5
18 pages
Data Redundancy
No ratings yet
Data Redundancy
4 pages
SAP Hybris V6 Certified Development Professional - Study Guide
No ratings yet
SAP Hybris V6 Certified Development Professional - Study Guide
261 pages
Knowledge Discovery in Textual Databases (KDT)
No ratings yet
Knowledge Discovery in Textual Databases (KDT)
7 pages
HDP Components Detailed
No ratings yet
HDP Components Detailed
4 pages
ADB Chapter 2
No ratings yet
ADB Chapter 2
40 pages
7 4 Data Store VE Deploy Overview DV 1 1
No ratings yet
7 4 Data Store VE Deploy Overview DV 1 1
12 pages