0% found this document useful (0 votes)
73 views52 pages

EDA-Lecture 1

The document discusses key concepts related to data and statistics: 1. Data can come from measurements or experiments and be either qualitative (categorical) or quantitative (numerical). Common types of data include nominal, ordinal, interval, and ratio. 2. A population is the entire group being studied, while a sample is a subset of the population used to make inferences. Random sampling is preferred to avoid bias. 3. Statistics allows scientists to analyze samples and draw conclusions about populations despite random variation between samples. Key ideas include estimating population values from sample data and accounting for sampling error.

Uploaded by

Harambe Gorilla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views52 pages

EDA-Lecture 1

The document discusses key concepts related to data and statistics: 1. Data can come from measurements or experiments and be either qualitative (categorical) or quantitative (numerical). Common types of data include nominal, ordinal, interval, and ratio. 2. A population is the entire group being studied, while a sample is a subset of the population used to make inferences. Random sampling is preferred to avoid bias. 3. Statistics allows scientists to analyze samples and draw conclusions about populations despite random variation between samples. Key ideas include estimating population values from sample data and accounting for sampling error.

Uploaded by

Harambe Gorilla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

• The collection and analysis of data are fundamental

to science and engineering


• A major difficulty with scientific data is that they
are subject to random variation, or uncertainty.
That is, when scientific measurements are repeated,
they come out somewhat differently each time
• The methods of statistics allow scientists and
engineers to design valid experiments and to draw
reliable conclusions from the data they produce.
→plural of datum, which is originally a Latin
noun meaning “something given.”

→facts or information used usually to


calculate, analyze, or plan something.
Qualitative or categorical data
→ categorical data, which take the form of categories
or attributes level.
course , year level , race , religion

Quantitative data or numerical data


→ obtained from measurements.

heights , weights, scores, temperatures


The article “Hysteresis Behavior of CFT Column to H-Beam Connections
with External T-Stiffeners and Penetrated Elements” (C. Kang, K. Shin, et
al., Engineering Structures, 2001:1194–1201)

The torques, in the middle column, are numerical data. The


failure locations, in th rightmost column, are categorical data.
two mutually exclusive
Binary Data categories

categories with no
Qualitative data Nominal Data specific rank or order

categories with specific


Data Ordinal Data rank or natural order

Discrete Data can be counted


Quantitative data
Continuous Data can be measured
Nominal Measurements
→ are used only for identification or classification purposes.

Ordinal measurements
→ depict the order of variables and not the difference between each of

the variables.
Interval Measurements
→ a numerical scale where the order of the variables is known as well

as the difference between these variables.


Ratio Measurements
→ not only produces the order of variables but also makes the difference

between variables known along with information on the value of true zero.
Population Sample
the entire group that you want to a group that you will collect
draw conclusions about. data from
Population Sample
the entire group that you want to a group that you will collect
draw conclusions about. data from
It refers to the process of
selecting and using a sample
to draw inference about
population from which
sample is drawn.
a. We may wish to draw conclusions about the weights of 12,000
adult students (the population) by examining only 100 students (a
sample) selected from this population.
b. We may wish to draw conclusions about the percentage of
defective bolts produced in a factory during a given 6-day week by
examining 20 bolts each day produced at various times during the
day.
What is the population? all bolts
What is the sample? 120 selected bolts
• The basic idea behind all statistical methods of data
analysis is to make inferences about a population by
studying a relatively small sample chosen from it
• Consider a machine that makes steel rods for use in optical storage
devices. The specification for the diameter of the rods is 0.45 ± 0.02
cm. During the last hour, the machine has made 1000 rods. The
quality engineer wants to know approximately how many of these
rods meet the specification. He does not have time to measure all
1000 rods.
• So he draws a random sample of 50 rods, measures them, and finds
that 46 of them (92%) meet the diameter specification.
• it is unlikely that the sample of 50 rods represents the population
of 1000 perfectly. The proportion of good rods in the population is
likely to differ somewhat from the sample proportion of 92%.
Engineer might need to answer on the basis of these sample data
1. The engineer needs to compute a rough estimate of the likely size of the difference
between the sample proportion and the population proportion. How large is a typical
difference for this kind of sample?
2. The quality engineer needs to note in a logbook the percentage of acceptable rods
manufactured in the last hour. Having observed that 92% of the sample rods were
good, he will indicate the percentage of acceptable rods in the population as an
interval of the form 92% ± x%, where x is a number calculated to provide reasonable
certainty that the true population percentage is in the interval. How should x be
calculated?
3. The engineer wants to be fairly certain that the percentage of good rods is at least
90%; otherwise he will shut down the process for recalibration. How certain can he be
that at least 90% of the 1000 rods are good?
• The basic idea behind all statistical methods of data
analysis is to make inferences about a population by
studying a relatively small sample chosen from it

• We must first learn more about methods of


collecting data and of summarizing clearly the basic
information they contain. These are the topics of
sampling and descriptive statistics
• As mentioned, statistical methods are based
on the idea of analyzing a sample drawn
from a population.
• For this idea to work, the sample must be
chosen in an appropriate way.
• The best sampling methods involve random
sampling
• A physical education professor wants to study the physical
fitness levels of students at her university. There are 20,000
students enrolled at the university, and she wants to draw a
sample of size 100 to take a physical fitness test. She
obtains a list of all 20,000 students, numbered from 1 to
20,000. She uses a computer random number generator to
generate 100 random integers between 1 and 20,000 and
then invites the 100 students corresponding to those
numbers to participate in the study. Is this a simple random
sample?
Yes, this is a simple random sample, she uses a
computer randomize
• A quality engineer wants to inspect rolls of wallpaper in
order to obtain information on the rate at which flaws in
the printing are occurring. She decides to draw a sample of
50 rolls of wallpaper from a day’s production. Each hour
for 5 hours, she takes the 10 most recently produced rolls
and counts the number of flaws on each. Is this a simple
random sample?
No. Not every subset of 50 rolls of wallpaper is
equally likely to comprise the sample.
• a sample that is not drawn by a well-defined
random method.
• The big problem with samples of convenience is
that they may differ systematically in some way
from the population.
• For this reason samples of convenience should not
be used, except in situations where it is not feasible
to draw a random sample.
• Simple random samples always differ from
their populations in some ways, and
occasionally may be substantially different
• Two different samples from the same
population will differ from each other as well
• A quality inspector draws a simple random sample of 40 bolts from
a large shipment and measures the length of each. He finds that 34
of them, or 85%, meet a length specification. He concludes that
exactly 85% of the bolts in the shipment meet the specification. The
inspector’s supervisor concludes that the proportion of good bolts is
likely to be close to, but not exactly equal to, 85%. Which
conclusion is appropriate?
Because of sampling variation, simple random samples don’t reflect the population
perfectly. They are often fairly close, however. It is therefore appropriate to infer
that the proportion of good bolts in the lot is likely to be close to the sample
proportion, which is 85%. It is not likely that the population proportion is equal to
85%, however.
• Continuing Example, another inspector repeats the study with a
different simple random sample of 40 bolts. She finds that 36 of
them, or 90%, are good. The first inspector claims that she must
have done something wrong, since his results showed that 85%, not
90%, of bolts are good. Is he right?
No, he is not right. This is sampling variation at work. Two different
samples from the same population will differ from each other and
from the population.
→ occur when one or more parts of the population are favored
over others

Convenience Sample
→ only include people who are easy to reach

Voluntary Response Sample


→ consist of people that have chosen to include themselves
Good sample is one that is representative of the entire
population; Gives each things an equal chance to of
being chosen
• The differences between the sample and its population are
due entirely to random variation
• Since the mathematical theory of random variation is well
understood, we can use mathematical models to study the
relationship between simple random samples and their
populations.
• For a sample not chosen at random, there is generally no
theory available to describe the mechanisms that caused
the sample to differ from its population. Therefore,
nonrandom samples are often difficult to analyze reliably.
• Tangible populations
o populations consisted of actual physical objects
o students at a university, the concrete blocks in a pile, the
bolts in a shipment
• Conceptual population
o produced by measurements made in the course of a
scientific experiment, rather than by sampling from a
tangible population
o consider data like these to be a simple random sample
from a population
• A geologist weighs a rock several times on a sensitive scale.
Each time, the scale gives a slightly different reading.
Under what conditions can these readings be thought of as
a simple random sample? What is the population?
If the physical characteristics of the scale remain the same for each
weighing, so that the measurements are made under identical
conditions, then the readings may be considered to be a simple random
sample. The population is conceptual. It consists of all the readings that
the scale could in principle produce.
• A new chemical process has been designed that is supposed to produce a
higher yield of a certain chemical than does an old process. To study the yield
of this process, we run it 50 times and record the 50 yields. Under what
conditions might it be reasonable to treat this as a simple random sample?
Describe some conditions under which it might not be appropriate to treat this
as a simple random sample.
The population is conceptual and consists of the set of all yields that will result
from this process as many times as it will ever be run. What we have done is to
sample the first 50 yields of the process.
If, and only if, we are confident that the first 50 yields are generated under
identical conditions, and that they do not differ in any systematic way from the
yields of future runs, then we may treat them as a simple random sample.
Identify if the sample is random
• A new chemical process is run 10 times each morning for five consecutive
mornings. A plot of yields in the order they are run does not exhibit any
obvious pattern or trend. If the new process is put into production, it will be
run 10 hours each day, from 7 A.M. until 5 P.M. Is it reasonable to consider
the 50 yields to be a simple random sample? What if the process will always be
run in the morning?
Since the intention is to run the new process in both the morning and the
afternoon, the population consists of all the yields that would ever be observed,
including both morning and afternoon runs. The sample is drawn only from that
portion of the population that consists of morning runs, and thus it is not a simple
random sample.
• The items in a sample are said to be independent if knowing
the values of some of them does not help to predict the
values of the others.
• With a finite, tangible population the items in a simple
random sample are not strictly independent, because as each
item is drawn, the population changes.

• A rule of thumb is that when sampling from a finite


population, the items may be treated as independent so long as
the sample comprises 5% or less of the population.
Sampling with replacement
• With this method, the population is exactly the same on every
draw and the sampled items are truly independent.
Conceptual population
• sample items be produced under identical experimental
conditions. In particular, then, no sample value may influence
the conditions under which the others are produced. Therefore,
the items in a simple random
• conceptual population as being infinite, or equivalently, that
the items are sampled with replacement.
• Simple random sampling is not the only valid method of
random sampling. But it is the most fundamental, and we will
focus most of our attention on this method. From now on,
unless otherwise stated, the terms “sample” and “random
sample” will be taken to mean “simple random sample.”
One-sample experiment
• there is only one population of interest, and a single sample is
drawn from it.
Multisample study
• here are two or more populations of interest, and a sample is
drawn from each population.
• Factorial experiments
• the populations are distinguished from one another by the
varying of one or more factors that may affect the outcome
Controlled experiment
• Designed to determine the effect of changing one or more
factors on the value of a response.
Observational study
• The experimenters cannot control.
• For example, there have been many studies conducted to
determine the effect of cigarette smoking on the risk of lung
cancer; We cannot control who smokes Types of Experiments and who
doesn’t
• Describe, show, and summarize the basic features of a dataset
found in a given study, presented in a summary that describes
the data sample and its measurements

• Measure of Central Tendency


• Measure of Dispersion
• Measure of Relative Position
• A measure of central tendency is a single value that attempts to
describe a set of data by identifying the central position within
that set of data.
• The mean, median and mode are all valid measures of central
tendency, but under different conditions, some measures of
central tendency become more appropriate to use than others.
• The sample mean is also called the “arithmetic mean,” or,
more simply, the “average.” It is the sum of the numbers in the
sample, divided by how many there are.
• It can be used with both discrete and continuous data,
although its use is most often with continuous data
• A simple random sample of five men is chosen from a
large population of men, and their heights are measured.
The five heights (in inches) are 65.51, 72.30, 68.31, 67.05,
and 70.68. Find the sample mean.
• The mean has one main disadvantage: it is particularly
susceptible to the influence of outliers. These are values that
are unusual compared to the rest of the data set by being
especially small or large in numerical value.
• Determine the mean wages of staff at a factory below.

Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k

• The mean salary for these ten staff is $30.7k. However,


inspecting the raw data suggests that this mean value might
not be the best way to accurately reflect the typical salary
of a worker, as most workers have salaries in the $12k to
18k range. The mean is being skewed by the two large
salaries.
• Outliers are a real problem for data analysts. For this
reason, when people see outliers in their data, they
sometimes try to find a reason, or an excuse, to delete
them.
• An outlier should not be deleted, however, unless there is
reasonable certainty that it results from an error.
• If a population truly contains outliers, but they are deleted
from the sample, the sample will not characterize the
population correctly.
• The median, like the mean, is a measure of center. To compute
the median of a sample, order the values from smallest to
largest. The sample median is the middle number.
• If the sample size is an even number, it is customary to take the
sample median to be the average of the two middle numbers.
• The median is often used as a measure of center for samples
that contain outliers.
• To see why, consider the sample consisting of the values
1, 2, 3, 4, and 20. The mean is 6, and the median is 3.
• It is reasonable to think that the median is more representative
of the sample than the mean is
• Determine the median of the data below.

65 55 89 56 35 14 56 55 87 45 92

We first need to rearrange that data into order of magnitude


(smallest first):
14 35 45 55 55 56 56 65 87 89 92
• Determine the median of the data below.

65 55 89 56 35 14 56 55 87 45

We first need to rearrange that data into order of magnitude


(smallest first):

14 35 45 55 55 56 56 65 87 89
Only now we have to take the 5th and 6th score in our data set
and average them to get a median of 55.5.
• The sample mode is the most frequently occurring value in
a sample
• Normally, the mode is used for categorical data where we
wish to know which is the most common category
• Normally, the
mode is used for
categorical data
where we wish to
know which is the
most common
category, as
illustrated below:
• Mode in
continuous
data
illustrated
below:
• Find the modes and the range for the sample below

There are three modes: 80, 179, and 232


• There are times that is provide more than
one mode which confuse us to identify
what is the real central tendency of the
data
• For continuous data, ei weight, it is
unlikely to have same weight, therefore we
can not identify the mode.
• Another problem with the mode is that it
will not provide us with a very good
measure of central tendency when the
most common mark is far away from the
rest of the data
Best measure of central
Type of Variable
tendency
Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median

You might also like