0% found this document useful (0 votes)
16 views11 pages

Class Notes 1 KEY Math 1114

This document provides an introduction to statistics, emphasizing the importance of critical thinking and understanding data versus information. It covers key concepts such as population, sample, census, and the significance of statistical analysis, including descriptive and inferential statistics. The document also highlights potential pitfalls in data analysis and the importance of context, source, and sampling methods in drawing valid conclusions.

Uploaded by

peterorangi8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views11 pages

Class Notes 1 KEY Math 1114

This document provides an introduction to statistics, emphasizing the importance of critical thinking and understanding data versus information. It covers key concepts such as population, sample, census, and the significance of statistical analysis, including descriptive and inferential statistics. The document also highlights potential pitfalls in data analysis and the importance of context, source, and sampling methods in drawing valid conclusions.

Uploaded by

peterorangi8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

KEY

Math 1114 Class Notes Chapter 1. Introduction to Statistics Vassilkov

1.1 STATISTICAL and CRITICAL THINKING

"All information looks like noise until you break the code.”
- Hiro in Stephenson's Snow Crash.

Important Definitions (to know): data vs. information, statistics, population, sample, census, voluntary response sample.
Important Concepts (to understand): context and source of data, sampling method, analyzing data; practical significance
vs. statistical significance; distinguishing between statistical conclusions that are valid and those that are flawed.

The main goal of this section (and this course) is to help you develop statistical way of thinking, which involves critical
thinking and the ability to make sense of results.

Are you ready? Let’s begin!

Data is the plural of the word “Datum – single piece of information”. Data are collections of observations, such as genders,
measurements, or survey responses. Raw data (also basic data) is information that has not been processed or displayed in
any meaningful form.

The terms 'data' and 'information' are used interchangeably, however, the terms have distinct meanings:
– Data are facts, transactions and so on, which have been recorded. They are the raw input materials from which
information is processed.
– Information is data that have been produced in such a way as to be useful to the recipient. In other words it is
data with purpose. Information is what we think is important.

Example 1: Here is a raw data set for you. It represents a number of words per sentence from 12 pages randomly
selected from each of three books: “The Bear and the Dragon” by Tom Clancy, “Harry Potter and the
Sorcerer’s Stone” by J. K. Rowling, and “War and Peace” by Leo Tolstoy.

Clancy Rowling Tolstoy


15.0 15.7 20.6
9.8 9.0 28.0
8.1 16.3 12.0
13.5 14.5 11.5
24.0 9.7 17.4
9.8 7.4 19.7
33.0 14.0 20.3
9.4 16.1 17.8
8.3 13.9 22.1
11.3 12.5 31.4
11.4 17.2 18.3
12.4 11.5 11.7

Question: Look at this data. Is there anything you notice?

Answer: You may have noticed that the Tolstoy’s sentences contain more words than the sentences of the other two
authors. (Plus you probably have noticed that if you look at these numbers long enough – you become very sleepy.)
2

Follow this link to find nineteen awesome raw data sets, and see what information you can get from any of them:
https://www.springboard.com/blog/free-public-data-sets-data-science-project/

Collecting data is the very first step in the Pyramid of Understanding:

Statistics is a science, which helps to answer two questions:


1) How can we extract meaning from collection of data? (that is organize, describe, summarize when we have ALL the
data) - DESCRIPTIVE statistics – first half of this course
2) How do we infer data about the whole population when we only have SOME data? - INFERENTIAL statistics –
second half of this course

Population - is the entire group of individuals or objects that we want information about. (NOT just the ones we reach!)
Sample - is the part of the population that is actually observed /surveyed. (The ones we reach.)

Example 2: USA today conducted a poll of 800 divorced people who were asked if they wanted to marry again. It was
reported that “overall, 58% of divorced people say they don’t want to get married again”. Give the
population and the sample in this survey:

Population: All divorced people (in the U.S.)

Sample: The 800 divorced people who were surveyed.

Census - collection of data from every member of a population.


Here is the link to the U.S. Census Bureau (great collection of data!): http://www.census.gov/

All rights reserved. This material may not be reproduced, displayed, modified or distributed without the express prior written
permission of the copyright holder. For permission, contact [email protected].
3

Statistics is much more than plugging data into the formula. It involves consideration of the following factors.

PREPARE

1. Context of Data (determines the type of statistical analysis that should be used)
• What do the values represent?
• Why were they collected?

The“Boats vs. Manatees” Example:

Pleasure Boats (tens of 99 99 97 95 90 90 87 90 90


thousands)

Manatee Fatalities 92 73 90 97 83 88 81 73 68

Context: The table includes the number of registered pleasure boats in Florida (tens of thousands) and the number
of manatee fatalities from encounters with boats in Florida for each of several recent years.
Determine whether there is a relationship between numbers of boats and numbers of manatee deaths from boats.

2. Source of data
• Is the source objective or biased? Is there something to gain or lose by distorting results?
Bias - a systematic difference between the results obtained by the sample and the actual truth about the whole
population. Example of biased source: A car insurance company advertises that their new customers saved an
average of $350 by switching to this company’s policy.

“Boats vs. Manatees” Example continued: The data in the table are from the Florida Department of Highway Safety
and Motor Vehicles and the Florida Marine Research Institute. The sources certainly appear to be reputable.

3. Sampling Method
• Does the method chosen greatly influence the validity of the conclusion?

“Boats vs. Manatees” Example continued: The data were obtained from official government records known to be
reliable. The sampling method appears to be sound.

• Voluntary response samples (the respondents select themselves) often have bias, because those with special
interest are more likely to participate. Results are most of the time are flawed and not representative. Do NOT use
voluntary response samples in scientific work!

The following types of polls are common examples of voluntary response samples:
o Internet polls, in which people online can decide whether to respond
o Mail-in polls, in which people can decide whether to reply
o Telephone call-in polls, in which newspaper, radio, or television announcements ask that you voluntarily
call a special number to register your opinion

Example:

The advice columnist Ann Landers once asked her readers, "If you had it to do over
again, would you have children?" A few weeks later her column was headlined "70% OF
PARENTS SAY KIDS NOT WORTH IT." Indeed 70% of the nearly 10,000 parents who
wrote in said they would not have children if they could make the choice again. The
people who responded felt strongly enough to take the trouble to write Ann Landers.
Their letters showed that many of them were angry at their children. These people did
not fairly represent all parents. It is not surprising that a statistically designed opinion
poll on the same issue a few months later found that 91% of parents would have
children again. Ann Landers announced a 70% "No" result when the truth about parents
was close to 90% "Yes."
(You may find more details at
http://userpages.umbc.edu/~nmiller/POLI300/stat353annlanders.pdf)

All rights reserved. This material may not be reproduced, displayed, modified or distributed without the express prior written
permission of the copyright holder. For permission, contact [email protected].
4

ANALYZE

1. Graph the Data (we’ll get to it in Chapter 2)


2. Explore the Data (we’ll get to it in Chapter 3)
3. Apply Statistical Methods (using technology to obtain results)

Note: A good statistical analysis does not require strong computational skills. It requires using common sense and paying
careful attention to sound statistical methods.

CONCLUDE

Statistical Significance is achieved in a study if the likelihood of an event occurring by chance is 5% or less.

Example: Getting 98 girls in 100 random births is statistically significant because such an extreme outcome is not likely to
result from random chance. Getting 52 girls in 100 births is not statistically significant because that event could easily occur
with random chance.

Practical Significance
It is possible that some treatment or finding is effective, but common sense might suggest that the treatment or finding
does not make enough of a difference to justify its use or to be practical.

Example: The data may lead to the conclusion that a gasoline additive improves mileage – but if the estimated improvement
is only 0.1 mpg, common sense dictates that using the additive is not worth the time or the money. There is statistical
significance but not practical significance.

CRITICAL THINKING. Analyzing Data: Potential Pitfalls

"There are three kinds of lies: lies, damned lies, and statistics."
- Mark Twain.

Statistics can very often be misleading, which usually happen for two reasons:
1. Evil intent on the part of dishonest people.
2. Unintentional errors.

All rights reserved. This material may not be reproduced, displayed, modified or distributed without the express prior written
permission of the copyright holder. For permission, contact [email protected].
5

We will look at the following common cases of misleading statistics:

1. Misleading Conclusions
Make statements that are clear to those without an understanding of statistics and its terminology. Avoid making
statements not justified by the statistical analysis.

2. Reported Results
When collecting data from people, it is better to take measurements yourself instead of asking subjects to report the
results.

3. Voluntary Response Sample


Discussed above.

4. Small Samples
Example: Basing a school suspension rate on a sample of only three students. Conclusions should not be based on
samples that are far too small.

Note: In most cases, sample size must be no less than 30. We will discuss this in later chapters.

5. Possibility of Lying
“Everybody lies.” – Doctor House
How many people would answer honestly the following questions?
• “Have you ever used illegal drugs?"
• “Do you favor a constitutional amendment that would outlaw most abortions?"
• “Have you had more than one sexual partner in the past 6 months?"
• “Have you ever driven a motor vehicle while intoxicated?"

6. Order of Questions
Example: Would you say that traffic contributes more or less to air pollution than industry?
- 45% blamed traffic, 27% blamed industry.

Would you say that industry contributes more or less to air pollution than traffic?
- 24% blamed traffic, 57% blamed industry.

7. Loaded Questions
Always pay attention on how the survey is worded. Here is a good article on some ways a survey can be
biased: https://surveytown.com/10-examples-of-biased-survey-questions/

8. Nonresponse and Missing Data: People who refuse to talk to pollsters have a view of the world around them that is
markedly different than those who will let poll-takers into their homes. US Census suffers from missing data (homeless or
low income people).

All rights reserved. This material may not be reproduced, displayed, modified or distributed without the express prior written
permission of the copyright holder. For permission, contact [email protected].
6

Example 3: The Hawaii State Senate held hearings when it was considering a law requiring that motorcyclists wear
helmets. Some motorcyclists testified that they had been in crashes in which helmets would not have
been helpful. Which important group was unable to testify?

Answer: People, who were killed in motorcycle crashes, when a helmet may have saved their lives, could not testify.

9. Precise Numbers: Because as a figure is precise, many people incorrectly assume that it is also accurate. Giving results
too many decimal places sounds scientific, but it is only an estimate. Example: Men’s Health Magazine: “48.2% of men
think that women should offer to pay their share on a first date”. It means the same as “about half”.

10. Abuses of Percentages:


• A decrease of a certain percentage followed by an increase of the same percentage (but using the formerly
reduced value) is a different number than the one you started from. This happens because the reference value
changed!

Example 4: Decrease $200 by 10%, then increase the result by 10%.

Answer: Decrease $200 by 10%: 0.10 x $200 = $20


$200 – $20 = $180
Increase the result by 10%: 0.10 x $180 = $18
$180 + $18 = $198 – different number than the original $200!

• Pie charts: A good indicator of something being wrong is when the percentages do not sum up to 100%, like
in the pie chart below. Here, people were asked which types of pie are their favorite, but they could name more
than one. The categories are thus not mutually exclusive, and the chart makes no sense.

All rights reserved. This material may not be reproduced, displayed, modified or distributed without the express prior written
permission of the copyright holder. For permission, contact [email protected].
7

• Percentages can exceed 100% only when the context is a change or comparison.

Example 5: Which of the statements are using percentages in a correct way and which ones in an incorrect way?
1. He was 180% sure that this was the correct answer.
2. The population on the island increased by 250% in 4 years.
3. 150% of ARCC Students had to pay a tuition increase.
4. With this exercise machine you can increase the amount of weight you are able to lift by 200%.
5. The new Honda Civic has 270% more trunk volume than the new Mazda Miata.
6. She lost 130% of body fat.
7. This glass of orange juice has 300% of the daily-recommended dosage of Vitamin C.

Answer: Incorrect use: 1, 3, 6; (in a strictly biological sense # 7 is probably incorrect.)


Correct use: 2, 4, 5; mathematically, # 7 is a correct use of percentages

11. Correlation vs. Causation: Correlation does not imply causality!!!

Correlation: whenever event A happens, it is highly likely that event B happens.

Causation: event B happens BECAUSE OF event A.

Examples of illogically inferring causation from correlation:

1) B causes A (reverse causation)


Example: The faster windmills are observed to rotate, the more wind is observed to be.
Wrong conclusion: Therefore windmills, as their name indicates, are machines used to produce wind.

2) Third factor C (hidden third variable or lurking variable) causes both A and B
Example: As ice cream sales increase, the rate of drowning deaths increases sharply.
Wrong conclusion: Therefore, ice cream causes drowning. (Third factor is _________).

3) Coincidence
Example: With a decrease in the number of pirates, there has been an increase in global warming over the
same period. Wrong conclusion: Therefore, global warming is caused by a lack of pirates.

4) A causes B and B causes A


Example: Increased pressure and increased temperature.

All rights reserved. This material may not be reproduced, displayed, modified or distributed without the express prior written
permission of the copyright holder. For permission, contact [email protected].
8

1.2 Types of Data

“It is easy to lie with statistics, but it is easier to lie without them.” - Frederick Mosteller

Important Definitions (to know): parameter, statistic.


Important Concepts (to understand): distinguish between quantitative and categorical data; distinguish between discrete
and continuous data; distinguish between nominal, ordinal, interval, and ratio levels of measurement.

Parameter - a numerical measurement describing some characteristic of a Population.


Statistic - a numerical measurement describing some characteristic of a Sample.

Here are some steps you take to be able to tell the difference between a Statistic and a Parameter:

Step 1: Ask yourself: is this obviously a fact about the whole population? Sometimes that’s easy to figure out.
For example, with small populations, you usually have a parameter because the groups are small enough to measure:
10% of US senators voted for a particular measure. There are only 100 US Senators; you can count what every single one
of them voted.

Step 2: Ask yourself: is this obviously a fact about a very large population? If it is, you have a statistic.
For example, 45% of Jacksonville, Florida residents report that they have been to at least one Jaguars game. It’s very
doubtful that anyone polled in excess of a million people for this data. They took a sample, so they have a statistic.

Data can be qualitative and quantitative:

Quantitative (or numerical) data consist of numbers representing counts or measurements. Example: The weights of
supermodels

Qualitative (categorical or attribute) data consist of names or labels (representing categories). Example: The genders
(male/female) of professional athletes

Example 6: Which is an example of quantitative data?


a. Weights of high school students.
b. Genders of actors and actresses.
c. Colors of the rainbow.

Answer: a. Weights of high school students

Quantitative data can be discrete or continuous.

Discrete data - result when the number of possible values is either a finite number or a ‘countable’ number (0, 1, 2, 3, . .).
Example: The number of eggs that a hen lays

All rights reserved. This material may not be reproduced, displayed, modified or distributed without the express prior written
permission of the copyright holder. For permission, contact [email protected].
9

Continuous (numerical) data result from infinitely many possible values that correspond to some continuous scale that
covers a range of values without gaps or interruptions. Example: The amount of milk that a cow produces; e.g. 2.343115
gallons per day.
Example 7: Which is NOT an example of continuous data?
a. Temperature on a thermometer.
b. Number of students in an algebra class.
c. Mean weight of 100 flour sacks.
d. Amount of water pumped from a pond per day.

Answer: b. Number of students in an algebra class

Levels of Measurement

• Nominal - categories only (Example: Survey responses yes, no, undecided)


• Ordinal - categories with some order (Example: Course grades A, B, C, D, or F)
• Interval - differences but no natural starting point (Example: Years 1000, 2000, 1776, and 1492)
• Ratio - differences and a natural starting point (Example: Prices of college textbooks)

Tip: try a “ratio” test: If one number is twice the other, is the quantity being measured also twice the other quantity? If yes,
the data are at the ratio level.

Example 8: Questions on a survey are scored with integers 1 thru 5 with 1 representing Strongly Disagree and 5
Strongly Agree. This is an example of what kind of measurement?
a. Nominal.
b. Ratio.
c. Ordinal.
d. Interval.

Answer: c. Ordinal

Some other useful definitions:

Big data refers to data sets so large and so complex that their analysis is beyond the capabilities of traditional software
tools. Analysis of big data may require software simultaneously running in parallel on many different computers.

Data science involves applications of statistics, computer science, and software engineering, along with some other
relevant fields (such as sociology or finance).
Data science is becoming more popular every day. Follow this link to learn more about this profession:
https://www.domo.com/blog/8-reasons-why-data-science/

1.3 Collecting Sample Data

Important Concepts (to understand): understand simple random sample; distinguish between observational study and
experiment; between different sampling methods; between sampling and nonsampling errors.

Observational study vs. Experiment: In a observational study, measurements are taken from subjects as they are and
without any attempt to modify them; while in an experiment, measurements are taken from subjects at least some of which
have been modified in order to assess the effects of the modification.

All rights reserved. This material may not be reproduced, displayed, modified or distributed without the express prior written
permission of the copyright holder. For permission, contact [email protected].
10

DESIGN OF EXPERIMENTS

1. Replication is the repetition of an experiment on more than one individual. Good use of replication requires
sample sizes that are large enough so that we can see effects of treatments.
2. Blinding is a technique in which the subject doesn’t know whether he or she is receiving a treatment or a placebo.
Blinding is a way to get around the placebo effect, which occurs when an untreated subject reports an
improvement in symptoms.

3. Double-Blind
Blinding occurs at two levels:
The subject doesn’t know whether he or she is receiving the treatment or a placebo.
The experimenter does not know whether he or she is administering the treatment or placebo.

4. Randomization is used when subjects are assigned to different groups through a process of random selection.
The logic is to use chance as a way to create two groups that are similar.

5. Simple Random Sample


A sample of n subjects is selected in such a way that every possible sample of the same size n has the same
chance of being chosen. For example, assign a number to every student, write the numbers on cards, mix the
cards in a box, and randomly select 10 cards. Then any student can be selected, and any combination of 10
students is possible.

SAMPLING METHODS:

• Random sample: (the best sample!) Each individual member has an equal chance of being selected. A random
sample avoids bias, but usually is expensive.
• Systematic sample: Surveying / Drawing every nth person / item on the list or production line. (The first number
should be selected at random.)
• Cluster sample: Divide the population area into sections (or clusters); randomly select some of those clusters;
choose all members from selected clusters
• Stratified sample: The population is divided into groups that have a characteristic in common (stratum, plural -
strata). For example: age, gender, college major, or income etc. Then a random sample from each group is taken.
• Convenience sample: Use results that are easy to get. Usually, the results will be affected by bias. Try to AVOID
convenience samples in scientific work.
• Multistage Sampling: Collect data by using some combination of the basic sampling methods
• Voluntary response sample: The respondents select themselves. Do NOT use voluntary response samples in
scientific work.

Example 9: At a security checkpoint to a government facility, every 10th individual was more thoroughly searched
than the others. What type of sampling is this?

Answer: Systematic

TYPES OF OBSERVATIONAL STUDIES

1. Cross-sectional study: Data are observed, measured, and collected at one point in time, not over a period of
time.
2. Retrospective (or case control) study: Data are collected from a past time period by going back in time
(through examination of records, interviews, and so on).
3. Prospective (or longitudinal or cohort) study: Data are collected in the future from groups sharing common
factors (called cohorts).

Sampling error - the difference between a sample result and the true population result; such an error results from
chance sample fluctuations

Nonsampling error - sample data incorrectly collected, recorded, or analyzed (such as by selecting a biased sample,
using a defective instrument, or copying the data incorrectly).

All rights reserved. This material may not be reproduced, displayed, modified or distributed without the express prior written
permission of the copyright holder. For permission, contact [email protected].
11

Chapter 1 Additional Practice Problems (not in the Textbook)


a) Define the population (as accurately as possible).
b) Define the sample (as accurately as possible).
c) Define the sampling method.
d) Is the sample result biased? If so, who is over-represented, who under-represented? Explain.

1. A researcher conducts a survey about income of working Americans and asks 540 American travelers at Washington
Dulles International Airport about their yearly income. All travelers surveyed confirmed to have a job. She concludes that
the average income of a working American is $52,000.
a) Population: All working Americans.
b) Sample: The 540 travelers who were asked.
c) Convenience Sampling.
d) This sample is biased. A sample taken at an airport will not be representative of the whole population. People who
fly usually make more money than the average working American. It is a convenience sample in which better
salaries are overrepresented and lower salaries are underrepresented.

2. A sociologist wanted to investigate the attitude of women living in San Diego toward working mothers of pre-school
children. She took a random sample of 1500 residential addresses and sent students to interview any adult woman
living at the addresses chosen. The students worked between 10:00am and noon as well as from 1:00 pm to 3:00 pm
on weekdays. Of the 734 women interviewed most had a somewhat negative attitude toward working mothers.
a) Population: All women living in San Diego.
b) Sample: The 734 women who were interviewed.
c) Convenience Sampling.
d) Although the houses were picked at random, the times that the student workers collected the data made the
survey biased. At the selected times more “stay home” moms will contribute their opinion. That could explain the
overrepresentation of the “somewhat negative” attitude. “Stay home” moms and other women who do not work
might assign more value to raising children at home and less to career and professional development of women
with children. Clearly, women who are at home during the day are overrepresented in the survey and women who
have a full-time job are underrepresented.

3. A consumer magazine article asks readers to respond to the question, “How do you like to spend your free time?” and
from the responses of 100 persons it concluded that 74% Americans consider shopping a hobby.
a) Population: All Americans.
b) Sample: The 100 people who replied.
c) Voluntary Response.
d) This is a voluntary response sample in which the respondents selected themselves, which makes it a biased
selection. People who read this consumer magazine are definitely overrepresented in this sample while people who do
not read this magazine are probably not at all represented. Thus 74% is a higher statistic than the true percentage of
Americans who consider shopping a hobby.

4. A crime prevention unit is researching the percentage of telephone customers who have received phone calls of
obscene or intimidating nature in the past. They call 1800 persons chosen at random from different telephone books.
The statistic obtained will be used to assign next year’s budget on stopping criminal phone activities.
a) Population: All telephone customers.
b) Sample: The number of people from the 1800 people selected who were actually reached.
c) Random sample was attempted, but without success. It’s a convenience sample.
d) This sample is biased. People who have their telephone number listed in a telephone directory are overrepresented
while people with unlisted telephone numbers are underrepresented. But people who have experienced such calls in the
past are more likely to have an unlisted telephone number.

5. The administration of a community college would like to know if the college should offer more math classes online.
They attached an additional question to the evaluation form that is posted to the online math classes in the last quarter
of the semester. 86% of the students who took the survey said that an online math class was right for them. Based on
this survey, should the number of math classes offered online be raised?
a) Population: All college students who have to take math.
b) Sample: The online students who took the evaluation survey.
c) Convenience and Voluntary Response.
d) This sample is extremely biased. It only includes online math students, who feel extremely confident about taking
math online. Also, the survey was done in the last part of the semester, when people who were failing already
dropped the class. 86% is too high and the percentage of the students who really feel that an online math class is
right for them probably is much lower.
All rights reserved. This material may not be reproduced, displayed, modified or distributed without the express prior written
permission of the copyright holder. For permission, contact [email protected].

You might also like