0% found this document useful (0 votes)
20 views19 pages

05.1 Data Organization PRESENTATION

This document provides an introduction to statistics, covering key concepts such as data collection, descriptive statistics, statistical inference, and the importance of samples. It explains various methods for presenting data, including frequency distributions, histograms, and stem-and-leaf displays, as well as measures of central tendency and variation. Additionally, it discusses quartiles and percentiles in the context of data analysis.

Uploaded by

ckranock
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views19 pages

05.1 Data Organization PRESENTATION

This document provides an introduction to statistics, covering key concepts such as data collection, descriptive statistics, statistical inference, and the importance of samples. It explains various methods for presenting data, including frequency distributions, histograms, and stem-and-leaf displays, as well as measures of central tendency and variation. Additionally, it discusses quartiles and percentiles in the context of data analysis.

Uploaded by

ckranock
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Section 5 –

Introduction to Organization &


Description of Data (Statistics)
ENGR 3311 – Engineering Math Methods
Instructor: Michael Weeks, PhD
Spring 2024
Introduction to Statistics
• Statistics includes the collection, processing, analysis, and
interpretation of numerical data.
• The results from the statistical analysis can provide the basis for
making decisions or choosing actions.
• Statistics can also be described as the study of how to make
inference and decisions in the face of uncertainty and variability.
• Probability theory is devoted to the study of uncertainty and
variability.
• Descriptive Statistics:
• the presentation of data in tables and charts,
• the summarization of data by means of numerical
descriptions and graphs.
• Statistical Inference:
• the task of making generalizations based on the sample
data,
• allows for making inferences beyond the information
contained in the data set.
• Any experiment or investigation involves the collection of
relevant data.
• A thorough evaluation requires an exhaustive set of data to be
collected.
• A unit is defined as a single entity, usually an object or a person,
whose characteristics are of interest, i.e. the source of each
measurements.
• A variable is often used to quantify the characteristics of interest.
• A population of units is the complete collection of units about
which information is sought.
• A statistical population is the set of all measurements that
correspond to each unit in the entire population of units about
which information is sought.
2
Population and Sample

• It is often impossible or infeasible to obtain a complete set


of data.
• In most situations, we must work with only partial
information.
• A sample from a statistical population is the subset of
measurements that are actually collected in the course of
an investigation.
• The distinction between the data actually collected and the
vast information of all potential observations is a key to
understanding statistics.

• The sample needs both to be representative of the


population and to be large enough to contain sufficient
information to answer the questions about the population
that are crucial to the investigation.
• The selection of a sample from a finite population must be
done impartially and objectively.
• Avoid any bias due to self-selected samples.
• The selection can be carried out using a chance mechanism,
or a random number generator.

• Why use samples?


• The population may be too large to count
• The population may be too dangerous to observe
• The population may be too difficult to measure

3
Presentation of Data
• Data obtained from experiments and investigations are
often large and needs to be condensed into a suitable form
for extracting any meaningful information.
• It is typical to group the data and present them in tabular
or graphical form.
• Graphical presentations are often the most effective way of
communicating the information.

Raw data:
71 91 63 99 93 88 95 63 67
76 65 68 82 81 83 61 77 100
87 68 85 60 98 89 80 78 82

Figure 1.1: Histogram of Grades


Range Frequency 10

8
90-100
Frequency

6
80-89
4
70-79
2

60-69 0
60-69 70-79 80-89 90-100
Grade Range

4
Pareto Diagrams
• A diagram that contains both a bar chart and a line graph.
• The bar chart represents the individual values, and often in
descending order.
• The cumulative total, or percentage (%), is represented by
the line graph.
• The purpose of the Pareto diagram is to highlight the most
important among a set of factors.

• The following example illustrates the amount of


improvement that can be made by addressing the first two
major causes of faults on a machine.

Fault Frequency
Power fluctuations 6
Unstable controller 22
Operator error 13
Worn tool 2
other 5

5
Dot Diagrams

• Used for identifying variations, or patterns of variations.

3 6 -2 4 7 4 3

• Dot plots are usually used for small data sets. They are
useful for highlighting clusters, gaps, skews in distribution,
and outliers.
38.9 58.0 96.3 122.2 155.6 333.3 3408.0

• A dot diagram can be generated for multiple samples that


help reveal the differences between them.
Example:
• Samples of the copper content in a welding material
produced from one plant are as follows.
0.27 0.35 0.37
• Samples from another plant are as follows.
0.23 0.15 0.25 0.24 0.30 0.33 0.26

6
Frequency Distributions
• A frequency distribution is a table that divides a set of data
into a suitable number of classes (or categories), showing
also the number of items belonging to each category.
• This grouping often highlights some important features of
the data.
• Once the data have been grouped, each observation has lost
its identity in the sense that its exact value is no longer
known.
• The first step in constructing a frequency distribution
consists of deciding how many classes to use and choosing
the class limits for each class.
• Use between 5 and 15 different classes.
• The different classes should:
• Not overlap
• Accommodate all the data
• Have the same width
Raw data:
245 333 296 304 276 336 289 234 253 292
366 323 309 284 310 338 297 314 305 330
266 391 315 305 290 300 292 311 272 312
315 355 346 337 303 265 278 276 373 271
308 276 364 390 298 290 308 221 274 343

Classes Frequency
206 – 245 3
246 – 285 11
286 – 325 23
326 – 365 9
366 – 405 4

7
Frequency Distributions

• The number of observations in each class are counted to


obtain the frequency distribution.
• The class limits are given to as many decimal places as the
original data.
• The ranges in each class can be defined using the endpoint
convention.
• For example, and using the right-hand endpoint convention,
the class (205,245] includes all data between 205 and 245,
but not including 205.
• The class boundaries are the endpoints of the intervals that
specify each class.
• The class interval is the length of the range for the class. All
classes are typically of equal length.
• The class marks of a frequency distribution are obtained by
averaging successive class limits or boundaries.
Raw data:
245 333 296 304 276 336 289 234 253 292
366 323 309 284 310 338 297 314 305 330
266 391 315 305 290 300 292 311 272 312
315 355 346 337 303 265 278 276 373 271
308 276 364 390 298 290 308 221 274 343

Classes Frequency
(205,245] 3
(245,285] 11
(285,325] 23
(325,365] 9
(365,405] 4

8
Cumulative Distributions
• An alternative form of distributions into which data are grouped.
• A cumulative “less-than-or-equal-to” distribution shows the total
number of observations that are less than or equal to the given
values.
• A cumulative “less-than” distribution is when the class includes
the left-hand endpoint but not the right-hand endpoint.
• A cumulative “greater-than” distribution are similarly
constructed by adding the frequencies, one by one, starting at the
end of the frequency distribution.
Classes Frequency Cumulative Cumulative
(≤) (≥)
206 – 245 3 3 50
246 – 285 11 14 47
286 – 325 23 37 36
326 – 365 9 46 13
366 – 405 4 50 4

Percentage Distributions
• Comparing distributions can be easily done if they are each
converted in percentage distributions.
• This is accomplished by dividing each class frequency by the total
frequency (or number of observations) and multiply by 100.
• The result of the percentage of data that falls into each class of the
distribution.
Classes Frequency Frequency (%) Cumulative
(≤)
206 – 245 3 6% 6%
246 – 285 11 22% 28%
286 – 325 23 46% 74%
326 – 365 9 18% 92%
366 – 405 4 8% 100%

9
Graphs of Frequency Distributions

• The most common form of graphical representation of a


frequency distribution is the histogram.
• The histogram consists of rectangles with heights equal to
the class frequencies and the bases extending between
successive class boundaries.
Classes Frequency
206 – 245 3
246 – 285 11
286 – 325 23
326 – 365 9
366 – 405 4

• The cumulative distributions are typically presented in the


form of ogives, where the cumulative frequencies are
plotted at the class boundaries.
• The resulting points are connected by means of straight
lines.
• The curve is the steepest over the class with the highest
frequency.

Classes Cumulative
(205,245] 3
(245,285] 14
(285,325] 37
(325,365] 46
(365,405] 50

10
Stem-and-Leaf Displays
• The previous methods involved the grouping of large sets of
data to present them in a manageable form.
• This entailed some loss of information.
• To avoid the loss of information, the following stem-and-
leaf display can be used to keep track of the last digits of
the readings within each class.
• The stem is the left-hand column which contains the tens
digits.
• Each number to the right of the vertical line is a leaf.
• For example, the first row corresponds to the data 12, 17,
and 15.
Raw data:
29 44 12 53 21 34 39 25 48 23
17 24 27 32 34 15 42 21 28 37 10 – 19 2 7 5
20 – 29 9 1 5 3 4 7 1 8
Classes Frequency 30 – 39 4 9 2 4 7
10 – 19 3 40 – 49 4 8 2
20 – 29 8 50 – 59 3
}
30 – 39 5
}

40 – 49 3
stem leaves
50 – 59 1

• The same stem-and-leaf display


can be constructed with the 1 2 5 7
stem column only showing the 2 1 1 3 4 5 7 8 9
digit that corresponds to the 3 2 4 4 7 9
tens. 4 2 4 8
5 3
• The list of units to the right of
}

the vertical line, i.e. the leaves,


}

can also be listed in ascending stem leaves


order.
11
Descriptive Measures

• In addition to graphical representations of the data,


numerical measures can also be used to describe the data.

• The descriptive measures are computed from a sample of


data (raw or ungrouped) of measurements
.

• The sample mean is the sum of all of the observations in


the data set divided by the sample size .

• The sample median is the center, or location, of a set of


data. If the observations are arranged in an ascending or
descending order:
• the median is the middle value if the number of observations
is odd.
• the median is the average of the two middle values if the
number of observations is even.

12
Descriptive Measures (Example 1)

An engineering group receives e-mail requests for technical


information from sales and service. The daily numbers of e-
mails for six days are:

11 9 17 19 4 15

Find the mean and median.

13
Descriptive Measures

• The sample mean and median summarizes a set of data in


terms of a single number. It describes their “middle” or
average.
• Another important measure is the variation of a set of data
in terms of the amount by which the values deviate from
their mean.
• For a set of observations with a mean , the
following are the deviations from the mean.

• The mean of the deviations can used as a measure of the


variations in the set of data. But the deviations sum to .

• The most common measure of variation is the average of


the squared deviations from the mean . This is known as
the sample variance .

• The greater the variance, the larger the overall data range.
• The calculation of variance uses squares and thus weights
outliers more heavily than data very near the mean.
• The standard deviation of the observations is the square
root of the variance. It is more commonly used than the
variance since it can be expressed in the same units as the
observation.

14
Descriptive Measures (Example 2)

Calculate the variance and standard deviation of the


following data sample.
0.6 1.2 0.9 1.0 0.6 0.8

S2=0.055
S=0.235 15
Quartiles and Percentiles

• The median divides the set of data into two halves.

• When an ordered data set is divided into quarters, the


resulting division points are called sample quartiles.

• The first quartile, , is the data value that has of


observations below its value.

• The first quartile is also known as the 25th percentile, or


. .

• The median is also known as the 50th percentile.

Inter Quartile Range (IQR) = 3rd Quartile – 1st Quartile = Q3 – Q1

• The sample (100𝑝)th percentile is a value such that at least 100𝑝%


of the observation are at or below this value, and at least
100(1 − 𝑝)% are at or above this value. (0.00 < 𝑝 < 1.00)

To calculate the sample (100 )th percentile:


• Order the observations from the smallest to the largest.
• Determine the product .
• If 𝑛𝑝 is not an integer, round it up to the next integer and find the
corresponding ordered value.
• If 𝑛𝑝 is an integer 𝑘, find the mean of the 𝑘 and 𝑘 + 1 ordered
observations.

16
Quartiles and Percentiles (Example 3)

Find the 1st quartile, 2nd quartile, 3rd quartile, and the 93rd
percentile for the following ordered data.

221 234 245 253 265 266 271 272 274 276
276 276 278 284 289 290 290 292 292 296
297 298 300 303 304 305 305 308 308 309
310 311 312 314 315 315 323 330 333 336
337 338 343 346 355 364 366 373 390 391

Q1=278
Q2=304.5
Q3=323 17
P0.93=366
Boxplots
• The summary information
contained in the quartiles is
highlighted in a graphic display
called a boxplot.
• The center half of the data,
extending from the 1st to the 3rd
quartile, is represented by a
rectangle.
• The median is identified by a bar
within the box.
• A line extends from the 3rd
quartile to the maximum, and
another line extends from the 1st
quartile to the minimum of the
data set.

• Boxplots are also referred to as a


Five-number summary:
• Minimum
• 1st Quartile 𝑄
• 2nd Quartile 𝑄
• 3rd Quartile 𝑄
• Maximum
• Multiple boxplots on the same
display can reveal differences
and similarities among the
various sets of observations.

18
Descriptive Measures (Grouped Data)

• The calculation of the descriptive measures, such as sample


mean and standard deviation, can be simplified if the data
is grouped.

where is the class mark of the ith class, is the


corresponding class frequency, and is the number of classes
in the distribution.
Classes Frequency
(205,245] 3
(245,285] 11
(285,325] 23
(325,365] 9
(365,405] 4

19

You might also like