EDA-Lecture 1
EDA-Lecture 1
categories with no
Qualitative data Nominal Data specific rank or order
Ordinal measurements
→ depict the order of variables and not the difference between each of
the variables.
Interval Measurements
→ a numerical scale where the order of the variables is known as well
between variables known along with information on the value of true zero.
Population Sample
the entire group that you want to a group that you will collect
draw conclusions about. data from
Population Sample
the entire group that you want to a group that you will collect
draw conclusions about. data from
It refers to the process of
selecting and using a sample
to draw inference about
population from which
sample is drawn.
a. We may wish to draw conclusions about the weights of 12,000
adult students (the population) by examining only 100 students (a
sample) selected from this population.
b. We may wish to draw conclusions about the percentage of
defective bolts produced in a factory during a given 6-day week by
examining 20 bolts each day produced at various times during the
day.
What is the population? all bolts
What is the sample? 120 selected bolts
• The basic idea behind all statistical methods of data
analysis is to make inferences about a population by
studying a relatively small sample chosen from it
• Consider a machine that makes steel rods for use in optical storage
devices. The specification for the diameter of the rods is 0.45 ± 0.02
cm. During the last hour, the machine has made 1000 rods. The
quality engineer wants to know approximately how many of these
rods meet the specification. He does not have time to measure all
1000 rods.
• So he draws a random sample of 50 rods, measures them, and finds
that 46 of them (92%) meet the diameter specification.
• it is unlikely that the sample of 50 rods represents the population
of 1000 perfectly. The proportion of good rods in the population is
likely to differ somewhat from the sample proportion of 92%.
Engineer might need to answer on the basis of these sample data
1. The engineer needs to compute a rough estimate of the likely size of the difference
between the sample proportion and the population proportion. How large is a typical
difference for this kind of sample?
2. The quality engineer needs to note in a logbook the percentage of acceptable rods
manufactured in the last hour. Having observed that 92% of the sample rods were
good, he will indicate the percentage of acceptable rods in the population as an
interval of the form 92% ± x%, where x is a number calculated to provide reasonable
certainty that the true population percentage is in the interval. How should x be
calculated?
3. The engineer wants to be fairly certain that the percentage of good rods is at least
90%; otherwise he will shut down the process for recalibration. How certain can he be
that at least 90% of the 1000 rods are good?
• The basic idea behind all statistical methods of data
analysis is to make inferences about a population by
studying a relatively small sample chosen from it
Convenience Sample
→ only include people who are easy to reach
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
65 55 89 56 35 14 56 55 87 45 92
65 55 89 56 35 14 56 55 87 45
14 35 45 55 55 56 56 65 87 89
Only now we have to take the 5th and 6th score in our data set
and average them to get a median of 55.5.
• The sample mode is the most frequently occurring value in
a sample
• Normally, the mode is used for categorical data where we
wish to know which is the most common category
• Normally, the
mode is used for
categorical data
where we wish to
know which is the
most common
category, as
illustrated below:
• Mode in
continuous
data
illustrated
below:
• Find the modes and the range for the sample below