0% found this document useful (0 votes)
15 views54 pages

Lecture1 Olive's File

Statistics is the science of collecting and analyzing numerical data to infer proportions from representative samples, encompassing descriptive and inferential statistics. It involves understanding key concepts such as population, sample, variable, and different types of data, as well as methods for data collection and analysis. The document also discusses various sampling methods, scales of measurement, and the importance of measuring variability in data.

Uploaded by

algalibrifat900
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views54 pages

Lecture1 Olive's File

Statistics is the science of collecting and analyzing numerical data to infer proportions from representative samples, encompassing descriptive and inferential statistics. It involves understanding key concepts such as population, sample, variable, and different types of data, as well as methods for data collection and analysis. The document also discusses various sampling methods, scales of measurement, and the importance of measuring variability in data.

Uploaded by

algalibrifat900
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

What is Statistics?

Statistics: The practice or science of collecting and analyzing numerical


data in large quantities, especially for the purpose of inferring proportions
in a whole from those in a representative sample
Statistics is concerned with exploring, summarizing, and making
inferences about the state of complex systems.
Statistics is neither really a science nor a branch of mathematics. It is
perhaps best considered as a meta-science (or meta-language) for dealing
with data collection, analysis, and interpretation. As such its scope is
enormous and it provides much guiding insight in many branches of
science, business, etc. Critical statistical reasoning can be extremely useful
for making sense of the ever increasing amount of information becoming
available (e.g. via the web).
Two main areas of statistics:
Descriptive Statistics: collection,
presentation, and description of sample data.

Inferential Statistics: making decisions and


drawing conclusions about populations.
Example: A recent study examined the math and verbal SAT
scores of high school seniors across the country. Which of
the following statements are descriptive in nature and which
are inferential.
• The mean math SAT score was 492.
• The mean verbal SAT score was 475.
• Students in the Northeast scored higher in math but lower in
verbal.
• 80% of all students taking the exam were headed for
college.
• 32% of the students scored above 610 on the verbal SAT.
• The math SAT scores are higher than they were 10 years
ago.
Basic Terms
Population: A collection, or set, of
individuals or objects or events whose
elements have a characteristic in common.
Two kinds of populations: finite or infinite.

Sample: A representative subset of the


population.
Variable: A characteristic about each individual element of a
population or sample.
Data (singular): The value of the variable associated with one
element of a population or sample. This value may be a
number, a word, or a symbol.
Data (plural): The set of values collected for the variable from
each of the elements belonging to the sample.
Experiment: A planned activity whose results yield a set of
data.
Parameter: An unknown numerical value summarizing all the
data of an entire population.
Statistic: A numerical value summarizing the sample data.
Example: A college dean is interested in learning about the average age of
faculty. Identify the basic terms in this situation.

The population is the age of all faculty members at the college.


A sample is any subset of that population. For example, we might select
10 faculty members and determine their age.
The variable is the “age” of each faculty member.
One data would be the age of a specific faculty member.
The data would be the set of values in the sample.
The experiment would be the method used to select the ages forming the
sample and determining the actual age of each faculty member in the
sample.
The parameter of interest is the “average” age of all faculty at the college.
The statistic is the “average” age for all faculty in the sample.
Two kinds of variables:
Qualitative/Attribute/ Categorical Variable: A
variable that cannot take a numerical value.
Note: Arithmetic operations, such as addition and
averaging, are not meaningful for data resulting from
a qualitative variable.
Quantitative/Numerical Variable: A variable that
quantifies an element of a population.
Note: Arithmetic operations such as addition and
averaging, are meaningful for data resulting from a
quantitative variable.
Example: Identify each of the following examples as attribute
(qualitative) or numerical (quantitative) variables.

1. The residence hall for each student in a statistics class.


(Attribute)
2. The amount of gasoline pumped by the next 10 customers
at the local Unimart. (Numerical)
4. The color of the baseball cap worn by each of 20 students.
(Attribute)
5. The length of time to complete a mathematics homework
assignment. (Numerical)
6. The state in which each truck is registered when stopped
and inspected at a weigh station. (Attribute)
Qualitative and quantitative variables may be further
subdivided:

Nominal
Qualitative
Ordinal
Variable
Discrete
Quantitative
Continuous
Nominal Variable: A qualitative variable that categorizes (or
describes, or names) an element of a population.

Ordinal Variable: A qualitative variable that incorporates an


ordered position, or ranking.

Discrete Variable: A quantitative variable that can assume a


countable number of values. Intuitively, a discrete variable
can assume values corresponding to isolated points along a
line interval. That is, there is a gap between any two values.

Continuous Variable: A quantitative variable that can


assume an uncountable number of values. Intuitively, a
continuous variable can assume any value along a line
interval, including every possible value between any two
values.
Note:
1. In many cases, a discrete and continuous variable may
be distinguished by determining whether the variables
are related to a count or a measurement.
2. Discrete variables are usually associated with counting.
If the variable cannot be further subdivided, it is a clue
that you are probably dealing with a discrete variable.
3. Continuous variables are usually associated with
measurements. The values of discrete variables are only
limited by your ability to measure them.
Example: Identify each of the following as examples of
qualitative or numerical variables:
1. The temperature in Barrow, Alaska at 12:00 pm on any
given day.
2. The make of automobile driven by each faculty member.
3. Whether or not a 6 volt lantern battery is defective.
4. The weight of a lead pencil.
5. The length of time billed for a long distance telephone call.
6. The brand of cereal children eat for breakfast.
7. The type of book taken out of the library by an adult.
Example: Identify each of the following as examples of (1)
nominal, (2) ordinal, (3) discrete, or (4) continuous variables:
1. The length of time until a pain reliever begins to work.
2. The number of chocolate chips in a cookie.
3. The number of colors used in a statistics textbook.
4. The brand of refrigerator in a home.
5. The overall satisfaction rating of a new car.
6. The number of files on a computer’s hard disk.
7. The pH level of the water in a swimming pool.
8. The number of staples in a stapler.
Measuring Variables
• To establish relationships between variables,
researchers must observe the variables and
record their observations. This requires that
the variables be measured.
• The process of measuring a variable requires a
set of categories called a scale of
measurement and a process that classifies
each individual into one category.

15
A nominal scale is an unordered set of
categories identified only by name.
Nominal measurements only permit you to
determine whether two individuals are the
same or different. Examples…
An ordinal scale is an ordered set of
categories. Ordinal measurements tell you
the direction of difference between two
individuals. Examples…
16
An interval scale is an ordered series of equal-sized
categories. Interval measurements identify the direction and
magnitude of a difference. The zero point is located
arbitrarily on an interval scale. Interval scales are numeric
scales in which we know not only the order, but also the
exact differences between the values. The classic example of
an interval scale is Celsius temperature because the
difference between each value is the same. For example, the
difference between 60 and 50 degrees is a measurable 10
degrees, as is the difference between 80 and 70
degrees. Time is another good example of an interval scale
in which the increments are known, consistent, and
measurable.
17
• Here’s the problem with interval scales: they don’t have a
“true zero.” For example, there is no such thing as “no
temperature.” Without a true zero, it is impossible to
compute ratios. With interval data, we can add and
subtract, but cannot multiply or divide. 10 degrees + 10
degrees = 20 degrees, but 20 degrees is not twice as hot as
10 degrees. Again, there is no such thing as “no
temperature”. Bottom line, interval scales are great, but
we cannot calculate ratios, which brings us to our last
measurement scale.
• A ratio scale is an interval scale where a value of
zero indicates nothing of the variable. Ratio
measurements identify the direction and
magnitude of differences and allow ratio
comparisons of measurements. Ratio scales tell us
about the order, they tell us the exact value
between units, AND they also have an absolute
zero. Good examples of ratio variables include
height and weight.
Measure and Variability
• No matter what the response variable: there
will always be variability in the data.
• One of the primary objectives of statistics:
measuring and characterizing variability.
• Controlling (or reducing) variability in a
manufacturing process: statistical process
control.
Example: A supplier fills cans of soda marked 12 ounces.
How much soda does each can really contain?

• It is very unlikely any one can contains exactly 12 ounces of


soda.
• There is variability in any process.
• Some cans contain a little more than 12 ounces, and some
cans contain a little less.
• On the average, there are 12 ounces in each can.
• The supplier hopes there is little variability in the process,
that most cans contain close to 12 ounces of soda.
Data Collection
• First problem a statistician faces: how to
obtain the data.
• It is important to obtain good, or
representative, data.
• Inferences are made based on statistics
obtained from the data.
• Inferences can only be as good as the data.
Sources of Data
• Primary data are collected by the
investigator conducting the research.
• Secondary data is data collected by
someone other than the user. Common
sources of secondary data are censuses,
organizational records and data collected
through a research done by others.
Process of data collection:

1. Define the objectives of the survey or experiment.


Example: Estimate the average life of an electronic
component.
2. Define the variable and population of interest.
Example: Length of time for anesthesia to wear off after
surgery.
3. Defining the data-collection and data-measuring schemes.
This includes sampling procedures, sample size, and the
data-measuring device (questionnaire, scale, ruler, etc.).
4. Determine the appropriate descriptive or inferential data-
analysis techniques.
Methods used to collect data:

Experiment: The investigator controls or modifies the


environment and observes the effect on the variable under
study.

Survey: Data are obtained by sampling some of the


population of interest. The investigator does not modify the
environment.

Census: A 100% survey. Every element of the population is


listed. Seldom used: difficult and time-consuming to
compile, and expensive.
Sampling Frame: A list of the elements belonging to the
population from which the sample will be drawn.

Note: It is important that the sampling frame be representative


of the population.

Sample Design: The process of selecting sample elements


from the sampling frame.

Note: There are many different types of sample designs.


Usually they all fit into two categories: judgment samples and
probability samples.
Judgment Samples: Samples that are selected on the basis of
being “typical.”

Items are selected that are representative of the population.


The validity of the results from a judgment sample reflects
the soundness of the collector’s judgment.

Probability Samples: Samples in which the elements to be


selected are drawn on the basis of probability. Each element
in a population has a certain probability of being selected as
part of the sample.
Random Samples: A sample selected in such a way that
every element in the population has a equal probability of
being chosen. Equivalently, all samples of size n have an
equal chance of being selected. Random samples are obtained
either by sampling with replacement from a finite population
or by sampling without replacement from an infinite
population.

Note:
1. Inherent in the concept of randomness: the next result (or
occurrence) is not predictable.
2. Proper procedure for selecting a random sample: use a
random number generator or a table of random numbers.
Example: An employer is interested in the time it takes each
employee to commute to work each morning. A random
sample of 35 employees will be selected and their commuting
time will be recorded.

There are 2712 employees.


Each employee is numbered: 0001, 0002, 0003, etc. up to
2712.
Using four-digit random numbers, a sample is identified:
1315, 0987, 1125, etc.
Systematic Sample: A sample in which every kth item of the
sampling frame is selected, starting from the first element
which is randomly selected from the first k elements.

Note: The systematic technique is easy to execute. However,


it has some inherent dangers when the sampling frame is
repetitive or cyclical in nature. In these situations the results
may not approximate a simple random sample.

Stratified Random Sample: A sample obtained by


stratifying the sampling frame and then selecting a fixed
number of items from each of the strata by means of a simple
random sampling technique.
Proportional Sample (or Quota Sample): A sample
obtained by stratifying the sampling frame and then selecting
a number of items in proportion to the size of the strata (or by
quota) from each strata by means of a simple random
sampling technique.

Cluster Sample: A sample obtained by stratifying the


sampling frame and then selecting some or all of the items
from some of, but not all, the strata.
Biased Sampling Method: A sampling method that produces
data which systematically differs from the sampled
population. An unbiased sampling method is one that is not
biased.

Sampling methods that often result in biased samples:


1. Convenience sample: sample selected from elements of a
population that are easily accessible.
2. Volunteer sample: sample collected from those elements
of the population which chose to contribute the needed
information on their own initiative.
Remember: Responsible use of statistical
methodology is very important. The burden is on the
user to ensure that the appropriate methods are
correctly applied and that accurate conclusions are
drawn and communicated to others.
Frequency Distribution
• Raw data are collected data that have not been organized
numerically.
• When summarizing large masses of raw data, it is often
useful to distribute the data into classes, or categories, and
to determine the number of objects belonging to each class,
called the class frequency. A tabular arrangement of data
by classes together with the corresponding class
frequencies is called a frequency distribution, or
frequency table.
Forming Frequency Distributions

• Determine the largest and smallest numbers in the raw data and thus
find the range (the difference between the largest and smallest
numbers).
• Divide the range into a convenient number of class intervals having the
same size. If this is not feasible, use class intervals of different sizes or
open class intervals. The number of class intervals is usually between 5
and 20, depending on the data. Class intervals are also chosen so that
the class marks (or midpoints) coincide with the actually observed
data.
• Determine the number of observations falling into each class interval;
that is, find the class frequencies. This is best done by using a tally, or
score sheet
Forming Frequency Distributions

• The number of observations in a class interval is called


class frequency. The class interval with highest frequency
is called the modal class.
• The following is an example of frequency distribution for
heights of 100 male students.
Histogram
• A histogram or frequency histogram, consists of a set of
rectangles having (a) bases on a horizontal axis (the X
axis), with centers at the class marks and lengths equal to
the class interval sizes, and (b) areas proportional to the
class frequencies.
Frequency Polygon
• A frequency polygon is a line graph of the class
frequencies plotted against class marks (mid value). It can
be obtained by connecting the midpoints of the tops of the
rectangles in the histogram. An smooth frequency polygon
is called a frequency curve
Cumulative Frequency

• The relative frequency of a class is the frequency of the


class divided by the total frequency of all classes and is
generally expressed as a percentage.
• The total frequency of all values less than the upper class
boundary of a given class interval is called the cumulative
frequency up to and including that class interval.
• A cumulative frequency polygon is a line graph of the
cumulative class frequencies plotted against upper bound
of corresponding class. An smooth cumulative frequency
polygon is called a cumulative frequency curve or ogive
curve.
Exercise
• For given the final grades in mathematics of 80 students at State
University, prepare a frequency distribution with percentage and
cumulative frequency and draw histogram, frequency polygon,
frequency curve, cumulative frequency polygon and ogive
curve.
Which Graph?

• Dot-plot and stem-


and-leaf plot:
– More useful for small
data sets
– Data values are
retained

content.answers.com
Dot Plots
To construct a dot plot
1. Draw and label horizontal line
2. Mark regular values
3. Place a dot above each value on Sodium
the number line in
Cereals
Stem-and-leaf plots
• Summarizes quantitative
variables
• Separate each Sodium in
observation into a stem Cereals
(first part of #) and a
leaf (last digit)
• Write each leaf to the
right of its stem; order
leaves if desired
Boxplot

1. Box goes from the Q1 to Q3


2. Line is drawn inside the box at
the median
3. Line goes from lower end of
box to smallest observation not a
potential outlier and from upper
end of box to largest observation
not a potential outlier
4. Potential outliers are shown
separately, often with * or +
Comparing Distributions
Boxplots do not display the shape of the distribution as
clearly as histograms, but are useful for making graphical
comparisons of two or more distributions
Z-Score

An observation from a bell-shaped distribution is a


potential outlier if its z-score < -3 or > +3
Ways to chart categorical data
Because the variable is categorical, the data in the
graph can be ordered any way we want (alphabetical,
by increasing value, by year, by personal preference,
etc.).
– Bar diagram
Each category is
represented by
a bar.
Ways to chart categorical data
Show the categorical variable as a pie whose slices
are sized by counts or percents of the whole.).
– Pie Charts
Each category is
represented by
a slice.
Example: Top 10 causes of death in the United States, 2001

Rank Causes of death Counts Percent of top 10s Percent of total deaths

1 Heart disease 700,142 37% 29%

2 Cancer 553,768 29% 23%

3 Cerebrovascular 163,538 9% 7%

4 Chronic respiratory 123,013 6% 5%

5 Accidents 101,537 5% 4%
6 Diabetes mellitus 71,372 4% 3%
7 Flu and pneumonia 62,034 3% 3%
8 Alzheimer’s disease 53,852 3% 2%
9 Kidney disorders 39,480 2% 2%
10 Septicemia 32,238 2% 1%

All other causes 629,967 26%

For each individual who died in the United States in 2001, we record what was
the cause of death. The table above is a summary of that information.
Bar graphs
Each category is represented by one bar. The bar’s height shows
the count (or sometimes the percentage) for that particular
category.
800
700 Top 10 causes of death in the U.S., 2001
Counts (x1000)

600
500 The number of individuals
400 who died of an accident in
300 2001 is approximately
100,000.
200
100
0

ia
ry

us

s
ar

ia
es

ts
rs

er
as
on
to

em
en
ul
ce
as

l it

rd
ira

se
sc

el

um
cid
an
se

ic
so
m
va

sp

di

pt
C

ne
di

Ac

di
s
o

's
re

Se
te
rt

er

ey
eb
ea

ic

be

&

dn
on
er
H

u
ia

ei

Ki
Fl
C

hr

zh
D
C

Al
Counts (x1000)
Counts (x1000) H
ea
rt
Ac d

100
200
300
400
500
600
700
800

0
Al c
100
200
300
400
500
600
700
800

zh id is 0
ei e ea
m nt
s se
er
's s
di
se C
as C an
er ce
e eb rs
C r ov
C an C as
er ce hr
eb rs on cu
r ic l ar
ov
C as re
hr cu sp
on ira
ic la
re r t or
sp y
D ira Ac
ia t D
b et
or
y ia
ci
de
es b nt
m et
es s
Fl el
u lit m
& u s Fl
pn u el
lit
e & u
um pn s
H on Al e
ea ia zh um
rt ei
d is m on
ia
Ki e er
dn as 's
ey es di
Ki se
di
so
dn
ey a se
rd
e rs di
Top 10 causes of death in the U.S., 2001

Se so
 Easy to analyze

p rd
tic e rs
em
 Much less useful

Se
ia p
Sorted alphabetically

tic
em
ia
Bar graph sorted by rank
Multiple Bar Chart

Child poverty before and after government


intervention—UNICEF, 1996

What does this chart tell you?

•The United States has the highest rate of


child poverty among developed nations
(22% of children under 18).

The poverty line is defined as 50% of national median income.


Component Bar Diagram
Percentage Component Bar Diagram

You might also like