0% found this document useful (0 votes)
6 views

Lecture Notes - Data

Uploaded by

hfzj7t6xkz
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture Notes - Data

Uploaded by

hfzj7t6xkz
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

Introduction to Statistics

Prof. L Prado
OER www.helpyourmath.com

1
Chapter 1

 Overview
 Nature of data
 Skills needed in statistics

The science of statistics is


Collecting, Organizing, Summarizing, Analyzing
information to draw conclusions from data or answer
questions.
2
Statistical
Methods

Descriptive Inferential
Statistics Statistics

Hypothesis
Estimation
Testing

3
Overview

Statistics: Survey: tool to collect data from a


smaller group which is part of a
 Descriptive
larger group to learn something
 Collection,organization about the larger group
sumarization, and presentation of
data.
 Inferential
 Draw conclusions with respect a
population by using samples.

Draw = Infer
Key goal of statistics:
Learn about a large group
(population) from data
from a smaller subgroup
(sample) 4
Overview
Definitions:
 Variable: It’s a characteristic or attribute that
varies.
 Data: are the values for the variable collected
(measurements,observations: gender, answers,…).
 Statistics: collection of methods to study data
 Population: complete collection of all subjects
(individuals, scores, measurements,…)
 Sample: subcollection of members selected from a
population.
 Census: collection of data from every member of
the population. (ex. US-Census).
5
Overview
Example:
 Poll: 1087 adults are asked whether they drink
alcoholic beverages or not.
 Sample: 1087 adults
 Population: US adults 150 million.
 Census: Every 10 years, the census bureau tries
to collect information from every member of
the US population.
 Impossible!
 Very expensive! (time and money)
 Use sample data to draw conclusions from
whole population: inferential statistics!
6
Parameter:
 A numerical measurement describing some
characteristic of the population.
 Lincoln elected: 39.82% of 1,865,908 votes
counted.
 39.82% is a parameter.

Statistic:
 A numerical measurement describing some
characteristic of the sample.
 Based on a sample of 877 elected executives, 45%
would not hire an applicant with a typographical
error in the application.
 45% is a statistic.
7
Types of data
Quantitative data: Numbers representing counts or measurements.
Number of children in a family,Weights, Heights,ages.
Qualitative data (Categorical data): Nonnumerical.
Gender of an athlete, Zip code, Blood type, States in the U.S., and
brands of TV.
Discrete(count) variable vs. continuous (measure) variable
# of people in a household vs. temperatures in May.
Nominal level of measurement: names, labels categories: no ordering.
Yes/No/Undecided responses, colors,gender,jersey numbers of players.
Ordinal level of measurement: some order(rank), but numerical values
meaningless or nonexistent.
grades A, B, C, D, F.,intensity of pain(none,mild,moderate, severe)
Interval level of measurement: order, but “no 0” or meaningless.
Temperature, year, IQ score.
Ratio level of measurement: Interval level with meaningful zero.
Weights, prices (non-negative), number of phones calls received.
8
Summary

 The process of statistics is designed to collect and analyze data to


reach conclusions
 Variables can be classified by their type of data
 Qualitative variables: Nominal or Ordinal.
 Quantitative variables:
Discrete: (values counted)
Continuous:(values measured)

9
Basic skills
Samples:
 representative:
 “39/40 polled people vote for A” Sampled in A’s headquarters!
 Not too small:
 CDF published “among HS students suspended, 67% suspended
more than 3 times” Sample size: 3!

Graphs: In which one does red do better?


Median Weekly Income (16-24) Median Weekly Income (16-24)

$390 $400
$380 $350
$370 $300
$360
$250
$350
$200
$340
$150
$330
$320 $100
$310 $50
$300 $0
Men Women Men Women

Percentage of: Percentage >>> decimal:


• 6 % of 1200 = 6 / 100 * 1200 = 72 • 27.3% = 27.3/100 = 0.273
Fraction >>> percentage: Decimal >>> percentage:
• 3/4 = 0.75 >>> 0.75 * 100% = 75 % • 0.852 >>>10 0.852 * 100%

= 85.2%
Basic skills 2
Calculator:

11
Statistical Study
Observational study: observe and measure characteristics without
trying to modify individuals.
 Gallup poll, Nielsen Media poll (TV shows).
 Cross-sectional: data observed, measured at one point in time.
 Retrospective: data are collected from the past (records)
 Prospective: data collected along the way from groups
(Smokers/Non-Smokers)
Experiment: apply treatment to individuals and observe and
measure effects.
 Clinical trial for Lipitor.
 Treatment group(Lipitor) and Control group (placebo group)
 Control: comparison, single-blinding , double-blinding,
placebo,blocks
 Replication: ability to repeat the experiment
 Randomization: data needs to be collected in an
12 appropriate
(random) way, otherwise it is completely useless!
● A completely randomized design is when each
experimental unit is assigned to a treatment
completely at random
● An example
 A farmer wants to test the effects of a fertilizer
 We choose a set of plants to receive the treatment
 We randomly assign plants to receive different levels of
fertilizer
● This has similarities to completely random sampling

13
● We control as many factors as we can
 Amount of watering
 Method of tilling
 Soil acidity
● Randomization decreases the effects of uncontrolled
factors
 Rainfall
 Sunlight
 Temperature

14
• A randomized block design is when the
experimental units are grouped and then
each group is assigned a treatment at
random
• The groups are called blocks
• This design will reduce confounding
• This has similarities to stratified sampling
Remark: When two effects cannot be distinguished,
this is called confounding 15
16
Summary
• The planning for designed experiments is
crucial to the success of the experiment
• A double-blind implementation of
experiments reduces the amount of
changes in behavior
• There are different good methods for
assigning treatments to experimental
units
– Completely random
– Randomized blocks 17

– Matched-pairs (I skipped!)
Sampling Design
Sampling:
 Simple random sample(SRS) of size n : every possible random sample of
size n individuals has the same chance of being chosen.
 Note an SRS also gives for each individual an equal chance to be
chosen (thus avoiding bias in the choice)
 systematic: select starting point and every kth member chosen.
 convenience: use easy to get data.
 stratified: subdivide population into at least 2 subgroups with
common characteristic(homogeneous) and draw samples from each
(e.g. gender age, animal species,)
 cluster: divide population into areas and draw samples form
clusters(intact groups representative of the population)
(ex. The city blocks, geografic areas)
Sampling error: the difference between a sample result and the
true population result; results from chance sample fluctuations
Nonsampling error: occurs when data is incorrectly
18
collected,
measured, recorded or analyzed.
19
20
Summary
• There are other sampling methods that are
particularly useful in certain situations
– Stratified sampling to cover the different strata
– Systematic sampling when the frame is unknown
– Cluster sampling to reduce the time and expense
required
– Multistage sampling for effective large scale samples
• The choice of sampling methods depends on the
structure of the population and the goals of the
analyst

21
Sources of Error
In Sampling
• One type of error, sampling errors, occur
because we use only part of the population in our
study
– Samples consist of only part of the total data
– Samples are usually more realistic to analyze
– Because there are individuals in the population that
are not in our sample, sampling errors are difficult to
control
• We will study sampling errors in future chapters
22
Types of nonsampling error
•Using an incomplete frame
•Individuals who respond have different
characteristics than individuals who do
not respond
•Interviewer errors
•Misrepresented answers
•Data checks
•Questionnaire design
•Wording of questions
•Order of questions, words, and
responses 23
• Another type of error, nonsampling errors,
occur from the actual survey process
– Preference is given to selecting some individuals
over others
– Individual answers are not accurate (for various
reasons)
• Nonsampling errors can often be controlled or
minimized with a well-designed survey and
sampling technique
24
• The Literary Digest used their polls to predict
the winner of presidential elections
• Their previous polls were accurate
• In 1936, the Literary Digest predicted that Alf
Landon would defeat Franklin Roosevelt in a
landslide
• In the actual election, Roosevelt won in a
landslide
25
• Why was the Literary Digest so far off?
• The 1936 frame was not representative of the total
voting population
– The sampling process was not completely random
– The frame had too large of a proportion of
Republicans, who generally favored Landon
– The frame had too small of a proportion of Democrats,
who generally favored Roosevelt
• Republicans were overrepresented and Democrats
were underrepresented!

26

You might also like