Probability & Statistics Basics
Probability & Statistics Basics
Descriptive Statistics
Population & Sample
Descriptive measures quartiles
Percentiles & box plots
Lecture 1
● Statistics
● Descriptive Statistics
● Statistical Inference
● Population vs Sample
● Frequency Distributions
● Cumulative Distributions
● Sample Mean
● Sample Median
● Deviations from Mean
● Variance
● Standard Deviation
● Quartiles
● Percentiles
● Box Plots
Statistics
What is statistics?
Statistics is the study and manipulation of data, including ways to gather, review, analyze,
and draw conclusions from data.
Answers provided by statistical analysis can provide the basis for making better decisions
and choices of actions. Statistical reasoning and methods can help you become efficient at
obtaining information and making useful conclusions.
Descriptive Statistics
One must decide carefully how far to go in generalizing from a given set of data.
Population Sample
All the students in the class are population. All the students who regularly attend class
is a sample.
Frequency Distributions
A frequency distribution is a table that divides a set of data into a suitable number
of classes (categories), showing also the number of items belonging to each class.
Instead of knowing the exact value of each item, we only know that it belongs to a
certain class.
Example
Data
245 333 296 304 276 336 289 234 253 292 366 323 309 284 310 338 297 314 305 330 266 391 315 305 290 300
292 311 272 312 315 355 346 337 303 265 278 276 373 271 308 276 364 390 298 290 308 221 274 343
(205,245]
Note that the class limits are given to as many decimal places as the original data. Had the original data been given to one
decimal place, we would have used the class limits 205.1–245.0, 245.1–285.0, …, 365.1–405.0.
Class Mark and Class Interval
Class Mark: The class marks of a frequency distribution are obtained by averaging
successive class boundaries.
Class Interval: If the classes of a distribution are all of equal length then
subtraction the lower limit from the upper limit gives the class interval.
Class Interval: 40
Cumulative Distribution(less than or equal to variant)
(205,245] 3
(245,285] 14
(285,325] 37
(325,365] 46
(365,405] 50
Descriptive Measures: Sample Mean
N measurements/data points
or
If it is desired to eliminate the effect of extreme (very large or very small) values.
Question
A sample of five university students responded to the question “How much time, in
minutes, did you spend on the social network site yesterday?”
100 45 60 130 30 35
A sample of five university students responded to the question “How much time, in
minutes, did you spend on the social network site yesterday?”
100 45 60 130 30 35
Mean: 66.67
Median: 52.5
Descriptive Measures: Deviations from Mean
Descriptive Measures: Deviations from Mean
Data: 1 2 3 4 5 Mean 3
Data: -7 -3 3 10 12 Mean 3
We observe that the dispersion of a set of data is small if the values are closely
bunched about their mean, and that it is large if the values are scattered widely
about their mean.
Because the deviations sum to zero, we need to remove their signs. Absolute
value and square are two natural choices.
Reason for dividing by n−1 instead of n is that there are only n−1 independent deviations xi − x̄.
Because their sum is always zero, the value of any particular one is always equal to the negative
of the sum of the other n − 1 deviations.
If many of the deviations are large in magnitude, either positive or negative, their squares will be
large and s2 will be large. When all the deviations are small, s 2 will be small.
Example
The delay times (handling, setting, and positioning the tools) for cutting 6 parts on
an engine lathe are 0.6, 1.2, 0.9, 1.0, 0.6, and 0.8 minutes. Calculate s2.
Descriptive Measures: Standard Deviation
Notice that the units of s2 are not those of the original observations.
In previous question the data are delay times in minutes, but s2 has the unit
(minute)2
The standard deviation is by far the most generally useful measure of variation. Its
advantage over the variance is that it is expressed in the same units as the
observations.
Descriptive Measures: Quartiles
In addition to the median, which divides a set of data into halves, we can consider
other division points.
When an ordered data set is divided into quarters, the resulting division points are
called sample quartiles.
The first quartile, Q1, is a value that has one-fourth, or 25%, of the observations
below its value. The first quartile is also the sample 25th percentile P0.25.
Descriptive Measures: Percentile
The sample 100 pth percentile is a value such that at least 100p% of the
observations are at or below this value, and at least 100(1 − p)% are at or above
this value.
Descriptive Measures: Percentile
Question
Given the data
136 143 147 151 158 160 161 163 165 167 173 174 181 181 185 188 190 205
n = 18
Number of observations below or equal to 158 = 5 (atleast 4.5 required acc to definition)
Number of observations equal to or above 158 = 14 (atleast 13.5 required acc to definition)
Question
Given the data
136 143 147 151 158 160 161 163 165 167 173 174 181 181 185 188 190 205
Obtain the quartiles and the 10th percentile.
n = 18
Second: 18*(0.5) = 9 Therefore, we average the 9th and 10th ordered values
Q2 = average the 9th and 10th ordered values = (165+167)/2 = 166
Q3 = 181 P0.10 = 143
Descriptive Measures: Range & Interquartile Range
The minimum and maximum observations also convey information concerning the
amount of variability present in a set of data. Together, they describe the interval
containing all of the observed values.
The amount of variation in the middle half of the data is described by the
interquartile range.