0% found this document useful (0 votes)

9 views23 pages

Descriptive Statistics

The document provides an overview of descriptive statistics, including definitions of key terms such as statistics, data, population, sample, parameter, and statistic. It explains the process of statistics, types of variables, and methods for organizing and visually representing data, including frequency distributions, bar graphs, pie charts, and histograms. Additionally, it includes examples and R code for creating visual representations of data.

Uploaded by

Soham Chakraborty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views23 pages

Descriptive Statistics

Uploaded by

Soham Chakraborty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Descriptive Statistics

Some internet sources that may be useful for various aspects of statistics:
https://www.amstat.org/
http://www.stats.gla.ac.uk/steps/glossary/index.html

Some Terminologies
Statistics: Science of learning from data. It deals with collection, organization, analysis, and
interpretation of data.
Data: A list of observations for a variable (may be numerical or non numerical).

Ex. (a) Observe infant birth weights in a hospital: 7 lbs., 6 lbs 2 oz, 9 lbs etc.
(b) Observe gender variable among all students in this class: Male/ female.

The process of statistics:

1. Identify the research objective.
2. Collect data.
3. Organize, summarize, and analyze data.
4. Draw conclusion/inference.
Population: The entire group of individuals that is of our interest.
Sample: A subset of the population being studied.
Ex. To investigate: “what percentage of students at ISI is female?” randomly select 50 students
near ISI canteen.
Population: all ISI students.
Sample: those 50 students selected for the study.
Parameter: Numerical characteristic of a population or a model (typically unknown).
Ex. In the ex above, the true proportion of female students at ISI.
Statistic: Numerical summary of data (typically estimates an unknown parameter).
Ex. Proportion of female students selected in the sample (sample proportion).
Variable: Examples are gender, infant birth weight, amount of rainfall in a day, etc.
Define the following terms in the study/ poll described below:
To determine average parking time of patrons who park at the mall, a statistician monitors and
records parking times of 105 patrons on July 6, 2012.

1
a) Variable: Parking times of patrons who park at the mall
b) Population: All patrons who parked at the mall.
c) Sample: 105 patrons who parked at the mall on July 6, 2012

Describing and summarizing datasets

Any dataset is a list of observations for a variable (numerical or non-numerical). A variable can be
of two types:
1. Qualitative/Categorical variable: Mathematical operations are NOT meaningful for this type
of variable. A categorical variable can be of two types:
(a) Nominal variable: Ordering is NOT meaningful for this type of variable.
Examples: (i) color: “red”, “blue”, “green”, etc (ii) nationality: “American”, “Canadian”,
“Australian” etc, (iii) Gender: “male”, “female”, (iv) zip codes, (v) race, (vi) religion.
(b) Ordinal variable: Ordering is meaningful for this type.
Examples: (i) letter grades: “A”, “B”, “C”, “D”, and “F”, (ii) “completely agree”,
“mostly agree”, “mostly disagree”, “completely disagree” when measuring opinion, (iii)
educational level.
2. Quantitative/Numerical variable: Mathematical operations are meaningful for this type of
variable. A numerical variable can be of two types:
(a) Discrete variable: Number of objects (cars, books, pens, players, committee members,
coins, students, cellular phones, etc.)
(b) Continuous variable: Height, weight, distance, volume, temperature, cholesterol level,
age, etc.
Data can be summarized: Graphically and Numerically.

Organization of categorical data

Frequency of a category: Number of occurrences of that category in the data.
Relative Frequency of a category = Frequency of a category/sum of all frequencies.
Frequency distribution or table: Lists categories of data and the corresponding frequencies of each
category.
Relative frequency distribution or table: Lists categories of data and the corresponding relative
frequencies of each category.

2
Ex. Organize the following data by constructing a frequency distribution and relative frequency
distribution for the colors of M & Ms:
Br Y G O Bl R R Bl O G Y Br G Y Br O Bl R G O R Br Y G Br Y O R O R Y Br R Br Br Y Y
R Br R Br Br Y Y Br
Table 1: Frequency or relative frequency table

Category (color) Tally Frequency Relative frequency

Brown ||||| ||||| || 12 12/45
Yellow ||||| ||||| 10 10/45
Red ||||| |||| 9 9/45
Orange ||||| | 6 6/45
Blue ||| 3 3/45
Green ||||| 5 5/45

Note that total frequency = 45 and total relative frequency = 1.

Visual representation of categorical data

1. Bar graph: A graph constructed by labeling each category of data on the x-axis or the y-axis
and frequency or relative frequency on the other axis.
• Bars are rectangles separate from one another and of equal width.
• Bars represent categories and their heights represent frequencies or relative frequencies.
Ex. Using the same data for M & Ms construct a bar graph of the frequency distribution for M & M
colors:

R code for Bar graph of M&M data

freq = c (12 ,10 ,9 ,6 ,3 ,5)
Gems = m a trix ( f r e q , n c o l =1 , byrow=T)
rownames (Gems) = c ( " Brown " , " Yellow " , " Red " , " Orange " , " Blue " , " Green " ) colnames (Gems) = c ( " Frequency
Gems = a s . t a b l e (Gems) # r e a d i n g a s data t a b l e
Gems = a s . data . frame (Gems) # r e a d i n g a s data frame

b a r p l o t ( Gems$Freq , main=" bar graph o f M & M c o l o r s " , x l a b =" C o l o r s " ,

y l a b =" Frequency " , names . a r g = Gems$Var1 , c o l=rainbow ( l e n g t h ( f r e q ) ) )

3
2. Pareto graph: A bar graph in which the bars are drawn in decreasing order of frequency or
relative frequency.

R code for Pareto graph of M&M data

Gems_sorted = Gems [ o r d e r (Gems [ , 3 ] , d e c r e a s i n g=TRUE) , ] # s o r t i n g data frame by f r e q u e n c y
b a r p l o t ( Gems_sorted$Freq , main=" p a r e t o graph o f M & M c o l o r s " , x l a b =" C o l o r s " ,
y l a b =" Frequency " , names . a r g = Gems_sorted$Var1 , c o l=rainbow ( l e n g t h ( f r e q ) ) )

3. Pie chart: A pie chart is a circle divided into sectors. The sectors represent the categories, and
the area of a sector is proportional to the relative frequency of the category.
Ex. Following data represents the marital status of US residents (in millions) 18 years of age or
older in 2006. Calculate the relative frequencies in percentage.

Marital status Frequency Relative frequency

Never married 125 125/500
Married 290 290/500
Widowed 30 30/500
Divorced 55 55/500

Note that total frequency = 500.

R code for pie chart of marital status data

freq = c (125 ,290 ,30 ,55)

M a r i t a l S t a t u s = matrix ( f r e q , n c o l =1 , byrow=T)
rownames ( M a r i t a l S t a t u s ) = c ( " Never Married " , " Married " , " Widowed " , " D i v o r c e d " )
colnames ( M a r i t a l S t a t u s ) = c ( " Frequency " ) M a r i t a l S t a t u s = a s . t a b l e ( M a r i t a l S t a t u s )
M a r i t a l S t a t u s = a s . data . frame ( M a r i t a l S t a t u s )
r e l . f r e q = round ( f r e q /sum ( f r e q ) ∗ 1 0 0 , d i g i t s = 1 ) l b l s = p a s t e ( M a r i t a l S t a t u s $ V a r 1 ,
" ( " , r e l . f r e q , " % ) " , s e p = " " ) p i e ( f r e q , l a b e l s = l b l s , c o l=rainbow ( l e n g t h ( f r e q ) ) )

4
Organization and visual representation of discrete data
For a discrete variable, if the possible values are relatively few then consider each value as a
category.

Ex. The following discrete data represent the number of cars in a household based on a random
sample of 50 households. Construct a frequency and relative frequency distribution.
30121112024222122024113241212233212203222321221135
Table 2: Frequency table

Category (No. of Cars) Frequency

0 4
1 13
2 22
3 7
4 3
5 1

R codes for barplot of car data

d = c (3 ,0 ,1 ,2 ,1 ,1 ,1 ,2 ,0 ,2 ,4 ,2 ,2 ,2 ,1 ,2 ,2 ,0 ,2 ,4 ,1 ,1 ,3 ,2 ,4 ,1 ,2 ,1 ,2 ,2 ,3 ,3 ,2 ,1 ,
2 ,2 ,0 ,3 ,2 ,2 ,2 ,3 ,2 ,1 ,2 ,2 ,1 ,1 ,3 ,5)
f r e q = c ( sum ( d==0),sum ( d==1), sum ( d==2), sum ( d==3), sum ( d==4), sum ( d==5))
c a r s = m a trix ( f r e q , n c o l =1 , byrow=T)
rownames ( c a r s ) = c ( " no c a r " , " 1 c a r " , " 2 c a r s " , " 3 c a r s " , " 4 c a r s " , " 5 c a r s " )
colnames ( c a r s ) = c ( " Frequency " )

5
c a r s = as . t a b l e ( c a r s )
c a r s = a s . data . frame ( c a r s ) # r e a d i n g a s data frame

b a r p l o t ( c a r s $ F r e q , main=" bar graph o f number o f c a r s " , x l a b ="number o f c a r s " ,

y l a b ="number o f h o u s e h o l d s " , names . a r g = cars$Var1 , c o l=rainbow ( l e n g t h ( f r e q ) ) )

Visual representation of numerical data

1. Histogram: A graph representing the frequency distribution of a numerical data. A histogram
consists of rectangles of equal width that touch each another. In a histogram, classes are marked on
the horizontal axis and frequencies are represented by heights on the vertical axis.
Ex. (discrete data) Using the car data given above, construct a histogram.

For continuous data or for a discrete data with relatively large number of possible values, first
construct a grouped frequency distribution to draw a histogram.

Steps to construct a grouped frequency distribution and draw a histogram

1. Find the range of the data, where range = (highest data value - lowest data value)
2. Select number of classes (k) and create classes/bins/intervals of equal width.
3. The common class length, denoted as d, is called class width/ bin width (d). Choose class
width and number of classes such that their product is slightly larger than the range.
4. Pick a starting point (lower limit) smaller than the lowest data value and add the class width
to get the lower limit of the next class. Ensure the upper limit of the first and the lower limit
of the second class do not overlap (classes are disjoint) and classes cover the entire range of
data.

Ex. The following data represent integer scores for a statistics exam. Construct a histogram for the
given data.
60 47 82 95 88 72 67 66 68 98 90 77 86 58 64 95 74 72 88 74 77 39 90 63 68 97 70 64 70 70 58 78
89 44 55 85 82 83 72 77 72 86 50 94 92 80 91 75

6
Steps:
1. Range of the data = highest data value - lowest data value = 59.
2. Select number of classes and class width (for the first table), number of classes = 7 and class
width = 10. (7*10 = 70 works b/c 70 is larger than range)
3. Pick a starting point smaller than the lowest data value 39 (30 which is 1st lower limit).
4. Add the class width 10 to 30 to get the lower limit 40 of the 2nd class.
Important: For a grouped frequency distribution, classes must be disjoint and must cover the entire
range.

R code for drawing a histogram of exam score data

score = c (60 ,47 ,82 ,95 ,88 ,72 ,67 ,66 ,68 ,98 ,90 ,77 ,86 ,58 ,64 ,95 ,74 ,72 , 88 ,74 ,77 ,39 ,90 ,
63 ,68 ,97 ,70 ,64 ,70 ,70 ,58 ,78 ,89 ,44 ,55 ,85 ,82 ,83 ,72 ,77 ,72 ,86 ,50 ,94 ,92 ,80 ,91 ,75 ,76 ,78)

h i s t ( s c o r e , n c l a s s =7 , main=" Histogram o f s c o r e s with c l a s s width =10" , x l a b =" s c o r e s " ,

y l a b =" f r e q u e n c y " , c o l= " b l u e " )

• Lets draw a histogram of the same data with 13 classes and class width 5.

7
Remark: Shape of a histogram (distribution) for a dataset changes as the class width (d) or the
number of classes k changes. If n = sample size, then, one way to find k is to use the Sturges
formula: k = dlog2 ne + 1 (from binomial assuming normality). For the exam score data n = 50,
and hence, k = 7 using Sturges formula.

2. Stem and Leaf plot: This is a sorting or graphing technique sometimes used in computer
applications when the datasets are small.
In this plot, each data value is split into a “stem” and a “leaf” where the “leaf” is usually the
last/rightmost digit of the number and the other digits to the left of the “leaf” form the “stem”.
Sorting the data first helps in drawing this plot.
Ex. Construct a stem and leaf plot of the following data (ages of people).
12 20 23 32 35 38 38 39 41 43 43 50 51 52 53 53 55 58 59 59 85

Stem Leaf
1 2
2 0 3
3 2 5 8 8 9
4 1 3 3
5 0 1 2 3 3 5 8 9 9
6
7
8 5

Remark: The shape of the stem and leaf plot, if rotated 90 degree anticlockwise, resembles the
shape of a histogram. Usually there is no need to sort the leaves, although computer packages
typically do.

R code for stem and leaf plot of age data

age = c ( 1 2 , 2 0 , 2 3 , 3 2 , 3 5 , 3 8 , 3 8 , 3 9 , 4 1 , 4 3 , 4 3 , 5 0 , 5 1 , 5 2 , 5 3 , 5 3 , 5 5 , 5 8 , 5 9 , 5 9 , 8 5 )
stem ( age , s c a l e =2)

8
Measures of Central Tendency
Measures of central tendency are numerical values that locate, in some sense, the center of the data.
1. Sample Mean (x): Let x1 , x2 , . . . , xn be a set of sample values from a population. The sample
mean is the arithmetic mean of the values:
n
1X
x= xi .
n i=1

1/n
Similarly, one may consider the geometric mean xg = ( i=1 xi ) for nonnegative values x1 , . . . , xn
Qn
P −1
or the harmonic mean xh = n 1 n 1
i=1 xi as an alternative measure of central tendency. However,
the arithmetic mean is typically referred as the sample mean.
Note: E[X] is referred as the population mean corresponding to a random variable X and sample
mean x gives an approximate value of E[X] when x1 , . . . , xn are realizations of X.
2. Sample Median (x e): The value that lies in the middle of a dataset after arranging a data set
in ascending or descending order.

For odd sample size n: median (x
e) = n+12 -th value in the sorted data.
For even sample size n: median (x
e) = average of n2 -th and n2 + 1 -th value in the sorted data.

Remark 1: The population median, m, corresponding to a random variable X is such that P (X ≤

m) ≥ 0.5 and P (X ≥ m) ≥ 0.5. If the CDF of X, denoted by F is continuous, then m is the
solution of F (m) = 0.5. If x1 , . . . , xn are realizations of X, then x
e is an approximate value of m.

Remark 2: R codes for computing sample mean and median is mean(x) and median(x) respectively.
b): The value(s) of the data that occurs most frequently in the dataset.
3. Sample Mode (x

(a) (b) (c)

9
Ex. Find the mode(s) of the following data sets.
(a) {0, 1, 2, 3, 3, −3, 3, 6}, mode = 3.
(b) {−1, 1, 1.5, 1.5, 3, 4, 4.5, 4.5, 6}, mode = 1.5 and 4.5.
(c) {−1, 1, 2, 3, 4}, mode = no mode or each data point is a mode.

Remark:
• Relation between mean, median and skewness of a distribution:

Other measures of location: percentiles, quartiles and quantiles

Sample Percentiles: After arranging a dataset in ascending order, percentiles divide the data
into 100 equal parts.
For k = 1, 2, . . . , 100, k-th percentile Pk of a dataset is a value such that at least k% of the
observations are less than or equal to Pk and at least (100 − k)% of the observations are greater
than or equal to Pk .

Sample Quartiles: After arranging a dataset in ascending order, 3 quartiles: first quartile Q1 ,
second quartile Q2 , and third quartile Q3 divide the data into 4 equal parts. Clearly, Q2 = median.

Thus: Q1 = P25 , Q2 = P50 , and Q3 = P75 . There are alternative ways to find sample quartiles. For
example, first order a dataset to find the median that divides the dataset into 2 halves. Then Q1 is
the median of left half (with smaller values) and Q3 , is the median of right half (with larger values).
Quantiles: Percentiles and quartiles are special cases of quantiles. Population quantiles are points
partitioning the range of a probability distribution into disjoint intervals with equal probabilities,
and sample quantiles are points partitioning the observations in a sample in the same way.

10
For 0 < α < 1, the α-th population quantile qα of a random variable X is a point satisfying:

P (X ≤ qα ) ≥ α and P (X ≥ qα ) ≥ 1 − α.

If X is a continuous random variable, then qα is a solution of F (qα ) = P (X ≤ qα ) = α.

Probability density of a normal distribution, with population quartiles shown. The area below the
red curve is the same in the intervals (−∞, Q1 ), (Q1 , Q2 ), (Q2 , Q3 ), and (Q3 , ∞).

A α-th sample quantile qbα is a point such that at least α × 100% of the sorted data (arranged in
ascending order) is less or equal to qbα and at least (1 − α) × 100% of the sorted data is greater or
equal to qbα .
Remark: From the definition, note that quantiles may not be unique.

An algorithm to compute α-th sample quantile:

1. Arrange the data in increasing order and compute r = nα
2. If r is not an integer: Round it up to the nearest integer greater than r, i.e., dre = ceiling of
r. The rounded integer dre is the position (index) of a sample quantile qbα in the ordered list
of data points.
3. If r is an integer: The sample quantile qbα is the average of data values in positions r and
r + 1.
Ex. Find (a) 75th percentile, (b) 2nd quartile, and (c) 0.4-th quantile of rainfall (inches) data in
Boston during the month of April for 15 years:
9.6, 2.5, 3.9, 4.1, 5.9, 1.1, 2.7, 4, 4.7, 6.1, 1.8, 3.4, 4, 5.2, 6.2

Ordered data: 1.1, 1.8, 2.5, 2.7, 3.4, 3.9, 4.0, 4.0, 4.1, 4.7, 5.2, 5.9, 6.1, 6.2, 9.6.
(a) r = 15 ∗ 0.75 = 11.25. Thus, dre = 12. Therefore, 75th percentile = 5.9.
(b) r = 15 ∗ 0.5 = 7.5. Thus, dre = 8. Therefore, 2nd quartile = 4.
(c) r = 15 ∗ 0.4 = 6. Thus, 0.4-th quantile is the average of 3.9 and 4, which is 3.95.

R code for finding sample quantiles of rainfall data

rain = c (9.6 ,2.5 ,3.9 ,4.1 ,5.9 ,1.1 ,2.7 ,4 ,4.7 ,6.1 ,1.8 ,3.4 ,4 ,5.2 ,6.2)
q u a n t i l e ( r a i n , p r o b s=c ( 0 . 7 5 , 0 . 5 , 0 . 4 ) )

11
Note that R output is: qb0.75 = 5.55, qb0.5 = 4, and qb0.4 = 3.96 which are different from the values
obtained by the above algorithm. There are, in fact, 9 different algorithms available in R to compute
sample quantiles. The default is type=7.
Trimmed or Truncated Mean calculates the sample mean after discarding given parts of a sample
at the high and low end (typically discarding an equal amount of both). For some 0 < α < 0.5,
α × 100% trimmed mean is the mean computed by excluding α × 100% largest and α × 100%
smallest values from the sample and taking mean of remaining (1 − 2α) × 100% of sample values.
For example, 10% trimmed mean of the dataset

x = {2, 4, −1, −2, 0, 9, 20, 7, 1, 5}

is x0.1 = 3.375 (R code: mean(x, trim=0.1)) which is obtained by discarding −2 and 20 from the
sample and computing mean of remaining 8 observations. Note that sample mean and median are
4.5 and 3 respectively. Trimmed mean is more robust against outliers than usual mean and the
median can be regarded as a fully trimmed mean that is most robust. Alternatively, trimmed mean
can also be defined by omitting a fixed number of sample values. That is, if x1 , . . . , xn are ordered
(smallest to largest) sample values, g-times trimmed mean is (xg+1 + · · · + xn−g )/(n − 2g) for some
positive integer g.

Measures of dispersion (range, variance, standard deviation, IQR)

Measures of dispersion are numerical values that describe the spread or variability in the data.
Suppose we observe n data points x1 , x2 , . . . , xn .
1. Range (R) = max{x1 , . . . , xn } − min{x1 , . . . , xn } = the difference between the largest and the
smallest data value. R is not very informative and robust.
2. Sample variance (S 2 ): measures the average squared distance of all observations from their mean
x. That is, S 2 = i=1 (xi − x)2 . [Why dividing by (n − 1) instead of n?]
1 Pn
n−1
√
Sample standard deviation S is simply S 2 which has the same unit of measurement as xi s. The
greater the values of S 2 or S, the greater the spread of the data and vice versa.
Note: R code for sample variance is var(x), and for sample standard deviation is sd(x) where
x = c(x1 , . . . , xn ).
3. Mean Absolute Deviation: measures average absolute distance of the observations from a center
of the data, say mx . That is,
n
1X
M= |xi − mx | where mx = mean (x), median (x
e) or mode (x
b).
n i=1

When mx is either mean or median, M is sometimes referred as MAD.

4. Interquartile Range (IQR) = Q3 − Q1 where Q3 and Q1 are 3rd and 1st sample quartiles.
IQR represents the spread of the data between P25 and P75 , i.e., the middle 50% of the data. High
value of IQR indicates the data is more spread out and vice versa. Note that (Q3 − Q1 )/2 is
known as Quartile Deviation or Semi Interquartile Range.

12
• Based on IQR, outliers of a dataset may be defined. Let us define:

Lower fence = Q1 − 1.5 ∗ IQR

(
Fences :
Upper fence = Q3 + 1.5 ∗ IQR

• Outliers: Data points outside the lower or the upper fences can be defined as outliers of the
dataset.
• If outliers don’t affect a statistic substantially, it is considered resistant or robust.
• Among the measure of central tendencies discussed, which statistic is least affected by outliers
and which statistic is most affected by outliers?
• Among the measure of dispersion discussed, which statistic is robust and which statistic is not
robust?

Five number summary and boxplot

Five number summary: Min value, Q1 , median (Q2 ), Q3 , and max value are called the five
number summary.
Boxplot: Graphical representation of the five number summary, the upper and lower fences, and
outliers (if they exist).
Ex. Following is the data for interest rates charged by ten credit card companies.
6.5%, 12%, 14.4%, 14.4%, 14.3%, 13%, 13.3%, 13.9%, 9.9%, 14.5%

Sorted data: 6.5, 9.9, 12.0, 13.0, 13.3, 13.9, 14.3, 14.4, 14.4, 14.5
Steps:
1. Interquartile range (IQR) = Q3 − Q1 = 14.4 − 12 = 2.4.
2. Lower fence = Q1 − 1.5 ∗ IQR = 12 − 3.6 = 8.4.
3. Upper fence = Q3 + 1.5 ∗ IQR = 14.4 + 3.6 = 18.

13
4. Smallest value larger than the lower fence = 9.9
5. Largest value smaller than the upper fence = 14.5
6. Data value less than lower fence: 6.5 Data value greater than upper fence: NONE.
• Using Box plots to describe shapes of distributions:

Q. Based on the boxplot of interest rate data, what type of skewness does the distribution have?
R code for boxplot of interest rate data
interest = c (6.5 ,12 ,14.4 ,14.4 ,14.3 ,13 ,13.3 ,13.9 ,9.9 ,14.5)
b o x p l o t ( i n t e r e s t , main=" b o x p l o t o f i n t e r e s t r a t e data " ,
h o r i z o n t a l=TRUE)
a x i s ( 1 , a t=c ( 6 : 1 5 ) , t c k = −.025 , l a s =0)

Moments, measures of skewness and kurtosis

For a random variable X, r-th population raw moment is µ0r = E[X r ] for r = 1, 2, . . . .
If µ = E[X], then r-th population central moment is µr = E[(X − µ)r ] for r = 1, 2, . . . .

14
Let x1 , . . . , xn be n realizations (i.e., sample) from X. The r-th sample raw moment and r-th sample
central moments are
n n
1X 1X
m0r = xr and mr = (xi − x)r for r = 1, 2, . . . .
n i=1 i n i=1

Note: Sample mean x = m01 . Sample variance S 2 = n−1 1

[m02 − n(m01 )2 ]. For central moments,
m1 = 0, m2 = n−1
n S . Clearly, different moments give some measure about the distribution of data.
2

Measure of Skewness:
Recall: if random variable X is symmetric around mean µ = E[X], then E[(X − µ)r ] = 0 for
r = 1, 3, 5, . . . . (H.W.)
Similarly, for a symmetric distribution, the observations about the mean x, the positive terms
will cancel with the negative terms resulting in m3 = 0. Therefore, small value of m3 indicates a
somewhat symmetric distribution. Also, if m3 is positive, then the positive terms in m3 dominate
the negative terms, that is, there are more large observations to the right of x stretching further
than those to the left of x. We call this kind of data distribution as positively skewed, or skewed
to the right. Similarly, when m3 is negative, the data distribution is called negatively skewed, or
skewed to the left. Formally, a measure of skewness (or asymmetry) is defined as
1 n
(xi − x)3
P
m3
γ1 = √ 3 = qn i=1 3
m2 1 Pn
(x − x)2
n i=1 i

is known as (Fisher-Pearson’s) coefficient of skewness.

Note that the population coefficient of skewness is E[(X − µ)3 ]/σ 3 where σ 2 = V ar(X). Therefore,
this measure is scale or unit invariant, in the sense that if the observations are measured in some
other unit, the value of the skewness coefficient will not change.
Kurtosis:
The population kurtosis of a random variable X is the fourth standardized central moment:
( 4 )
X −µ
Kurt[X] = E[(X − µ) ]/σ = E
4 4
.
σ

Note that, for X ∼ N (µ, σ 2 ), Kurt[X] = 3. So, in order to compare kurtosis of a r.v. X
with that of the N (µ, σ 2 ), an alternative measure of kurtosis, called excess kurtosis is defined as
γ2 = E[(X − µ)4 ]/σ 4 − 3. Some authors refer excess kurtosis as kurtosis.
See https://en.wikipedia.org/wiki/Kurtosis#Excess_kurtosis for details about the distributions.
Sometimes, Kurt[X] is said to measure “peakedness” of the central value (say, mean) of a probability
distribution, as seen from the above picture. However, this interpretation is disputed. Recently,
Westfall (Ref: Westfall, Peter H. (2014), "Kurtosis as Peakedness, 1905 - 2014. R.I.P.", The
American Statistician, 68 (3): 191–195,), Noted as "...its unambiguous interpretation relates to tail
extremity. Specifically, it reflects either the presence of existing outliers (for sample kurtosis) or the
tendency to produce outliers (for the kurtosis of a probability distribution). The underlying logic
is straightforward: Kurtosis represents the average (or expected value) of standardized data raised
to the fourth power. Standardized values less than 1—corresponding to data within one standard
deviation of the mean (where the “peak” occurs)—contribute minimally to kurtosis. This is because

15
Probability density functions for selected distributions with mean 0, variance 1 and different excess
kurtosis. Here, D : Laplace, S : hyperbolic secant, L : Logistic, N : normal, C : raised cosine, W :
Wigner semicircle, U : uniform distributions.

raising a number less than 1 to the fourth power brings it closer to zero. The meaningful contributors
to kurtosis are data values outside the peak region, i.e., the outliers. Therefore, kurtosis primarily
measures outliers and provides no information about the central "peak".
The sample kurtosis for data x1 , x2 , . . . , xn is naturally given by:
m4
sample excess kurtosis = − 3.
m22

Q. Suppose data values x1 , x2 , . . . , xn are linearly transformed as y1 , y2 , . . . , yn where yi = a + bxi

for i = 1, . . . , n, and a, b ∈ R. A statistic T (x1 , . . . , xn ) is called invariant under linear transform if
T (x1 , . . . , xn ) = T (y1 , . . . , yn ). Can you comment on the invariance property of statistics studied so
far, i.e., sample mean, median, variance, IQR, moments, coefficient of skewness, and kurtosis?

Normal Datasets
Sometimes real datasets have bell-shaped histograms. A “bell-shaped” histogram, in some cases,
may be referred as the ideal bell-shaped curve, that is, the normal curve that defines a normal
distribution. (studied in your probability class). That is why the datasets that have bell-shaped

histograms are often called the normal datasets. For the normal datasets we have the empirical rule
that follows from the properties of the normal distribution. Ex. SAT math scores have a bell-shaped
distribution with mean 515 and standard deviation 114.
(a) What percentage of SAT scores is less than 401 or greater than 629?

16
Empirical rule for a bell-shaped distribution

(b) What percentage of SAT scores is greater than 743

Paired datasets and the sample correlation coefficient

Two numerical variables are often studied together to investigate any possible relationship between
the two variables. Examples: (a) shoe size and weight; (b) height and weight of a student; (c) the
amount of time spent studying for statistics exam and the score on the exam.
Such datasets consist of pairs of observations {(xi , yi ) : i = 1, 2, . . . , n}. Suppose, for the time, that
x and y variables are numerical.
Ex. Ice cream sales data: The local ice cream shop keeps track of how much ice cream they sell
in a day and the noon temperature on that day. Observations for the last 12 days are recorded in
the table. In this example, x1 , . . . , x12 are 12 temperatures in Celsius and y1 , . . . , yn are 12 sales in
dollar.

Scatter diagram or plot: A two-dimensional graph where x-axis and y-axis represent data values
from x-variable and y-variable respectively.
Points on the scatter plot may create some patterns: (a) Linear: Points may follow an imaginary
line. (b) Nonlinear: Points may follow a curve.

17
Notice that warmer weather leads to more sales; there seem to be a linear relationship between
temperature and sale variable.
R code to draw scatter plot of ice cream data
data = r e a d . t a b l e ( " i c e c r e a m . t x t " )
p l o t ( data [ , 1 ] , data [ , 2 ] , x l a b =" t e mperature " , y l a b =" s a l e s " , type = " p " , c o l = " r e d " , lwd=2)
a b l i n e ( lm ( data [ , 2 ] ~ data [ , 1 ] ) )
Note : you need t o keep t h e data f i l e `` i c e c r e a m . t x t ' ' i n t h e c u r r e n t R d i r e c t o r y .

• Qs Is there a numerical measure that determines linear relationship between variables?

Sample correlation coefficient (r) measures the strength and direction of linear relationship
between two numerical variables.
Let x = and y = be the sample means of x and y variables and let corresponding
1 Pn 1 Pn
n i=1 xi n i=1 yi q q
sample standard deviations are Sx = n−1 i=1 (xi − x) and Sy =
n
i=1 (yi − y)2 . Let
1 2
P 1 Pn
n−1
sample covariance between x and y variables be
n
1 X
Sxy = (xi − x)(yi − y)
n − 1 i=1

The sample correlation coefficient, also called Pearson’s correlation coefficient, is

n
(xi − x)(yi − y)
P
Sxy
r= = pPn i=1 .
i=1 (xi − x) i=1 (yi − y)
pPn
Sx Sy 2 2

By Cauchy-Swartz inequality,
v v
n u n u n
(xi − x)(yi − y)| ≤ (xi − x) t (yi − y)2
X uX uX
| t 2
i=1 i=1 i=1

where equality holds iff yi = a + bxi for i = 1, . . . , n and a, b ∈ R.

Properties of r:
1. r is a unit free measure such that −1 ≤ r ≤ 1.
2. r only measures strength of linear relationship between x and y variables.
3. r > 0: positive linear relation, r < 0: negative linear relation. If r = ±1, two variables have
perfect linear relationship.

18
4. r = 0 implies no linear relationship between two variables. But, there may be some “nonlinear”
relationship between x and y variables.

• If we interchange x and y variables, does the correlation change?

• Is correlation coefficient r affected by linear transformation of data?
Population version of Pearson’s correlation coefficient between random variables X and Y is

cov(x, Y ) E[(X − µX )(Y − µY )]

ρ= =
σX σY σX σY

where µX = E[X], µY = E[Y ], σX = V ar(X), σY = V ar(Y ).

p p

19
Ex. What is wrong with these statements?
The correlation between height and weight of Computer Science students
(a) is 2.61
(b) is 0.61 inches per pound
(c) is 0.61, so the correlation between weight and height is −0.61
(d) is 0.61 using inches and pounds, but converting inches to centimeters would make r > 0.61
(since an inch equals about 2.54 centimeters)

Rank Correlations
There are correlation measures other than the Pearson’s correlation coefficient which can detect
non-linear relationships. These measures are based on ranks thereby depending on the relative
positions of the x and y components of the n pairs instead of their actual values. This is why
the name rank correlation. For this reason, one could use such measures for ordinal variables
as well, but the occurrence of ties will be too often creating some complications. Actually, the
following two rank correlation coefficients measures the degree of a monotone relationship between
two categorical/numerical variables.
General Properties:
A higher rank correlation coefficient implies more agreement between rankings. The coefficient is
inside the interval [-1, 1] and assumes the value:
• 1 if the agreement between the two rankings is perfect; the two rankings are the same.
• 0 if the rankings are completely independent.
• −1 if the disagreement between the two rankings is perfect; one ranking is the reverse of the
other.

Spearman’s Rank Correlation:

It is a measure of statistical dependence between the rankings of two variables. Recall that we have
observations of the form {(xi , yi ), i = 1, . . . , n}.
• Let ri be the rank of the i-th observation (i.e, xi ) w.r.t. the x-observations in ascending order.
Similarly, let si be the rank of the i-th observation (i.e, yi ) w.r.t. the y-observations.
• For the time, suppose there is no tie in these ranks.
• If there is perfect positive correlation, then si = ri for all i = 1, . . . , n.
• If there is perfect negative correlation, then we should have si = n − ri + 1 for all i = 1, . . . , n.
So, writing di = ri − si , note that
n n n
n(n + 1) n(n + 1)
di = si = = 0.
X X X
ri − −
i=1 i=1 i=1
2 2

20
So, this sum is of no use. Let us consider
Pn
i=1 di
n n
d2i = (ri − si )2 ,
X X

i=1 i=1

which is equal to 0 if ri = si for all i. For the other extreme when si = n − ri + 1 for all i,
n n
= (2ri − n − 1)2 = n(n2 − 1)/3. (verify)
X X
d2i
i=1 i=1

Spearman’s rank correlation coefficient rSP is defined as

6 ni=1 d2i
P
rSP =1− .
n(n2 − 1)

• Thus: −1 ≤ rSP ≤ 1 where rSP = 1 if there is a perfect positive association (i.e, if x increases,
then y increases and if x decreases, then y decreases) and rSP = −1 if there is a perfect negative
association (i.e, if x increases, then y decreases and if x decreases, then y increases).
• While Pearson’s correlation assesses linear relationships, Spearman’s correlation assesses monotonic
relationships. If there are no repeated data values, a perfect Spearman correlation of +1 or -1 occurs
when each of the variables is a perfect monotone function of the other.
Exercise: Check that rSP is same as the Pearson’s correlation coefficient r based on the rank
variables {(ri , si ) : i = 1, . . . , n} if all ranks are distinct integers (i.e., no tie occurs).
If tie occurs: If there is any tie, that is broken in the usual manner by assigning the same rank to
the units with tied observations. For example, if the 5-th and 6-th observations in the sorted data
are tied, we assign the rank 5.5 to both of them. Also, if the 5-th, 6-th and the 7-th observations in
the sorted data are tied, we assign the rank 6 to all three of them.

Kendall’s Tau τ :
Note that the observation in each unit, say the i-th unit, is a pair (xi , yi ) with corresponding ranks
(ri , si ). Again, assume no tie for the time.
• For each pair of units, say (i, j) with i < j, we assign a score of +1 if their ranks w.r.t.
to both x and y variables are in the same direction, that is, if {ri < rj and si < sj } or if
{ri > rj and si > sj }.
• We assign a score of −1 to a pair (i, j) with i < j if their ranks w.r.t. to both x and y variables
are in the reverse direction, that is, if {ri < rj and si > sj } or if {ri > rj and si < sj }.
• Let P be the number of pairs (i.e., (i, j) withi < j) with +1 scores and Q denote the number
of pairs with −1 scores. Clearly, P + Q = n2 .
Kendall’s τ is defined as
P −Q no. of concordant pairs − no. of discordant pairs
τ= n = n .
2 2

Properties: The denominator is the total number of pair combinations, so the coefficient must be
in the range −1 ≤ τ ≤ 1.

21
• If the agreement between the two rankings is perfect (i.e., the two rankings are the same) the
coefficient has value +1.
• If the disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the
other) the coefficient has value −1.
• If there is no monotonic relationship between x and y variables, then we would expect the
coefficient to be approximately zero.
If tie occurs: The ties are broken as before and a score of 0 is assigned to any pair (i, j), with
i < j, if the above two conditions for scores of +1 and −1 with strict inequality are not met.
Remark: For each pair (i, j), define

+1 if ri < rj


aij = −1 if ri > rj

0

if ri = rj .

Similarly, define 
+1

 if si < sj
bij = −1 if si > sj

0

if si = sj .
(H.W.) We can show that the Pearson’s correlation coefficient r between the aij and bij variables,
i.e., using {(aij , bij ) : i, j = 1, . . . , n} gives the Kendall’s τ .

R code for Pearson’s and rank correlations

x = rnorm ( 1 0 0 0 , 0 , 1 )
y = exp ( s q r t ( exp ( x ) ) )
c o r ( x , y , method = " p e a r s o n " )
c o r ( x , y , method = " spearman " )
c o r ( x , y , method = " k e n d a l l " )

Remark: Association doesn’t imply causation, i.e., two variables are associated does not necessarily
mean that one variable causes the other.
Ex 1. “number of firemen and severity of fire is highly correlated”
Wrong conclusion: High correlation imply that the firemen cause the fire
Ex 2. “The faster windmills are observed to rotate, the more wind is observed to be.”
Wrong conclusion: wind is caused by the rotation of windmills. In practice, it is rather the other
way around.
Ex 3. “Children that watch a lot of TV are the most violent. So, TV makes children more violent.”
Not necessarily true. This could easily be the other way round; that is, violent children like watching
more TV than less violent ones.
Ex 4. Study shows that there is a significant correlation (0.791) between a country’s consumption of
chocolate and the number of Nobel prizes (averaged per person).

22
23

Sony Hcd-s40d Ver.1.0 SM
No ratings yet
Sony Hcd-s40d Ver.1.0 SM
38 pages
Methods of Data Collection and Presentation
No ratings yet
Methods of Data Collection and Presentation
33 pages
163K Fresh Lines - 2024-10-04
No ratings yet
163K Fresh Lines - 2024-10-04
3,973 pages
Microsoft Azure Fundamentals: Microsoft AZ-900 Dumps Available Here at
No ratings yet
Microsoft Azure Fundamentals: Microsoft AZ-900 Dumps Available Here at
11 pages
PERSEUS - THE - DELIVERER Iom
No ratings yet
PERSEUS - THE - DELIVERER Iom
28 pages
GSE - 201710 - IBM - Z14-Technical - Overview - Final SUPER GOOD ONE
No ratings yet
GSE - 201710 - IBM - Z14-Technical - Overview - Final SUPER GOOD ONE
156 pages
A WIFE's Letter by Rabindranath Tagore
No ratings yet
A WIFE's Letter by Rabindranath Tagore
6 pages
Cloud Computing Aws Lab
No ratings yet
Cloud Computing Aws Lab
27 pages
LEVELMASTER H8 Utility User's Guide Rel 25 Apr 2006
No ratings yet
LEVELMASTER H8 Utility User's Guide Rel 25 Apr 2006
20 pages
AJ Sadler Specialist Unit 3, 4
83% (6)
AJ Sadler Specialist Unit 3, 4
371 pages
Stat 101
100% (4)
Stat 101
25 pages
Bio-Diversity Uses, Threats and Conservation: Dr. Anjay
100% (1)
Bio-Diversity Uses, Threats and Conservation: Dr. Anjay
23 pages
Introduction To Probability and Statistics Thirteenth Edition
No ratings yet
Introduction To Probability and Statistics Thirteenth Edition
30 pages
Statistics
No ratings yet
Statistics
289 pages
CH02 - Data Description 2
No ratings yet
CH02 - Data Description 2
85 pages
Secrets of Angels, Demons, Satan & Jinns - 230212 - 024915 - PDF - Satan - Jinn
No ratings yet
Secrets of Angels, Demons, Satan & Jinns - 230212 - 024915 - PDF - Satan - Jinn
256 pages
Statistics
No ratings yet
Statistics
135 pages
Manual de Servicio - Accutorr-3
No ratings yet
Manual de Servicio - Accutorr-3
66 pages
WEEK1
No ratings yet
WEEK1
36 pages
Course Code & Number:FET201
No ratings yet
Course Code & Number:FET201
70 pages
M 301 - Ch1 - Introduction To Statistics
No ratings yet
M 301 - Ch1 - Introduction To Statistics
96 pages
Unit 2 - Descriptive Analytics
No ratings yet
Unit 2 - Descriptive Analytics
85 pages
Spoto Ccna 200-125 Dumps
No ratings yet
Spoto Ccna 200-125 Dumps
5 pages
Unit 2. Teoria
No ratings yet
Unit 2. Teoria
86 pages
Topic 2 Analysis of Univariate Data
No ratings yet
Topic 2 Analysis of Univariate Data
87 pages
2. presenting of data - ١١١٠٥٩
No ratings yet
2. presenting of data - ١١١٠٥٩
39 pages
Hyperion Smart View User Guide
No ratings yet
Hyperion Smart View User Guide
73 pages
Lecture-2 & 3
No ratings yet
Lecture-2 & 3
94 pages
Data Organization
No ratings yet
Data Organization
69 pages
Safend Data Protection Suite 3.4.5 Installation Guide
No ratings yet
Safend Data Protection Suite 3.4.5 Installation Guide
76 pages
GPresets - InfoGuideTemplate-Beba Vowels
No ratings yet
GPresets - InfoGuideTemplate-Beba Vowels
8 pages
Chap 2. Data Presentation
No ratings yet
Chap 2. Data Presentation
72 pages
Variables & Chart
No ratings yet
Variables & Chart
60 pages
Chapter 2 - Tabular and Graphical Technique - Send
No ratings yet
Chapter 2 - Tabular and Graphical Technique - Send
59 pages
Statistics Pages
No ratings yet
Statistics Pages
67 pages
Statistics For Begineers
No ratings yet
Statistics For Begineers
28 pages
CS614 Mcqs MidTerm by Vu Topper RM-2
No ratings yet
CS614 Mcqs MidTerm by Vu Topper RM-2
53 pages
"Probability and Statistics (For Engineering) 235 M: Summer Session 2019/2020
No ratings yet
"Probability and Statistics (For Engineering) 235 M: Summer Session 2019/2020
45 pages
2-Organizing and Displaying Data
No ratings yet
2-Organizing and Displaying Data
65 pages
Statistics Ns 20231
No ratings yet
Statistics Ns 20231
49 pages
Organizing-Data 250120 180858
No ratings yet
Organizing-Data 250120 180858
32 pages
IE 220 Probability and Statistics: Descriptive Statistics - Graphical Summary: Describing Data With Graphs
No ratings yet
IE 220 Probability and Statistics: Descriptive Statistics - Graphical Summary: Describing Data With Graphs
36 pages
CH 2 - Luc
No ratings yet
CH 2 - Luc
45 pages
Lecture 1, 2 and 3
No ratings yet
Lecture 1, 2 and 3
45 pages
CH 1
No ratings yet
CH 1
40 pages
Newcomb Model
No ratings yet
Newcomb Model
2 pages
Introduction To Probability and Statistics
No ratings yet
Introduction To Probability and Statistics
30 pages
Lecture 2 Frequency Distribution and Graphical Representation
No ratings yet
Lecture 2 Frequency Distribution and Graphical Representation
35 pages
Module-1 Part 2 Data Visualization
No ratings yet
Module-1 Part 2 Data Visualization
35 pages
Topic 2
No ratings yet
Topic 2
31 pages
Part 1 Descriptive
No ratings yet
Part 1 Descriptive
42 pages
Topic 2 TT1713
No ratings yet
Topic 2 TT1713
31 pages
Probability Statistics Lecture 2
No ratings yet
Probability Statistics Lecture 2
38 pages
GET 321 - Compressed
No ratings yet
GET 321 - Compressed
32 pages
QBM 101 Business Statistics: Department of Business Studies Faculty of Business, Economics & Accounting HE LP University
No ratings yet
QBM 101 Business Statistics: Department of Business Studies Faculty of Business, Economics & Accounting HE LP University
34 pages
Chapter 2 SUMMARY Descriptive Statistics
No ratings yet
Chapter 2 SUMMARY Descriptive Statistics
32 pages
Topic 3
No ratings yet
Topic 3
22 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
Chapter 2 Describing Data Using Tables and Graphs
No ratings yet
Chapter 2 Describing Data Using Tables and Graphs
16 pages
01 - Introduction 1
No ratings yet
01 - Introduction 1
21 pages
Business Statistics For R: Name PRN
No ratings yet
Business Statistics For R: Name PRN
30 pages
Computer Part-I (Master)
No ratings yet
Computer Part-I (Master)
21 pages
Calculating Standard Deviation Step by Step
No ratings yet
Calculating Standard Deviation Step by Step
21 pages
DSA Midterm
No ratings yet
DSA Midterm
29 pages
Ed242 Lec2a Review Data
No ratings yet
Ed242 Lec2a Review Data
21 pages
Chapter 2-190810 074149
No ratings yet
Chapter 2-190810 074149
19 pages
Equency Tables and Diagrams
No ratings yet
Equency Tables and Diagrams
26 pages
Unix-Like-Systems
No ratings yet
Unix-Like-Systems
27 pages
S M E: D S: Tatistics With Atlab For Ngineers Escriptive Tatisics
No ratings yet
S M E: D S: Tatistics With Atlab For Ngineers Escriptive Tatisics
16 pages
Release Notes For Asyncos 12.5.2 For Cisco Email Security Appliances
No ratings yet
Release Notes For Asyncos 12.5.2 For Cisco Email Security Appliances
21 pages
0 - Image Processing
No ratings yet
0 - Image Processing
20 pages
Chapter 2 Math
No ratings yet
Chapter 2 Math
19 pages
Session-4-5-6-Statistics For Data Analytics-Dr - Girish - Bagale - IsAGx5vCqq
No ratings yet
Session-4-5-6-Statistics For Data Analytics-Dr - Girish - Bagale - IsAGx5vCqq
21 pages
The Daughters of The Late Colonel1920 Done
No ratings yet
The Daughters of The Late Colonel1920 Done
17 pages
C-Pointers
No ratings yet
C-Pointers
20 pages
DATAFRAME
No ratings yet
DATAFRAME
16 pages
2-ch3 Autoinstall
No ratings yet
2-ch3 Autoinstall
15 pages
FireFly A High-Throughput Hardware Accelerator For Spiking Neural Networks With Efficient DSP and Memory Optimization
No ratings yet
FireFly A High-Throughput Hardware Accelerator For Spiking Neural Networks With Efficient DSP and Memory Optimization
14 pages
Frequency Distribution & Graghs
No ratings yet
Frequency Distribution & Graghs
28 pages
DVC Notifier
No ratings yet
DVC Notifier
4 pages
An Er Diagram About College Database For Our
No ratings yet
An Er Diagram About College Database For Our
13 pages
Frequency Distribution
No ratings yet
Frequency Distribution
12 pages
A Presentation On Electronic Payment System (EPS)
No ratings yet
A Presentation On Electronic Payment System (EPS)
23 pages
S330 EufyCam (EufyCam 3) Manual Us
No ratings yet
S330 EufyCam (EufyCam 3) Manual Us
10 pages
Analysis On Handwriting Using Pen-Tablet For Identification of Person and Handedness
No ratings yet
Analysis On Handwriting Using Pen-Tablet For Identification of Person and Handedness
5 pages
Daggett1942 (4) Done
No ratings yet
Daggett1942 (4) Done
9 pages
CLion Clang Installation
No ratings yet
CLion Clang Installation
8 pages
Important Polynomials Questions
No ratings yet
Important Polynomials Questions
5 pages
HW 1
No ratings yet
HW 1
4 pages
Frequency, Distribution & Graphs
No ratings yet
Frequency, Distribution & Graphs
4 pages
COMM 215 Chapter 2 COMM 215 Chapter 2
No ratings yet
COMM 215 Chapter 2 COMM 215 Chapter 2
7 pages
Apr-25 Digital Auditing-17-20
No ratings yet
Apr-25 Digital Auditing-17-20
4 pages
Ling Part 2
No ratings yet
Ling Part 2
5 pages
Missed Class 2
No ratings yet
Missed Class 2
6 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
5 pages
DGA4130 Ultrfast-Technicolor-Router-Spec-Sheet
No ratings yet
DGA4130 Ultrfast-Technicolor-Router-Spec-Sheet
6 pages
Math 140 Chapter 2 Notes
No ratings yet
Math 140 Chapter 2 Notes
5 pages
Wa0038.
No ratings yet
Wa0038.
4 pages
Unit 01 Statistics
No ratings yet
Unit 01 Statistics
10 pages
Engineering Data Analysis
No ratings yet
Engineering Data Analysis
4 pages
HW 2
No ratings yet
HW 2
3 pages
Lines and Angles
No ratings yet
Lines and Angles
3 pages
Make Your Life Sparkle With Biometric: Innovations
No ratings yet
Make Your Life Sparkle With Biometric: Innovations
3 pages
Time-Table Semester 1 2025-26
No ratings yet
Time-Table Semester 1 2025-26
1 page
Install Smartplant Reference Data: Setup - Exe in The Main Folder
No ratings yet
Install Smartplant Reference Data: Setup - Exe in The Main Folder
2 pages
The Future of Augmented Reality: Hololens Microsoft'S Ar Headset Shines Despite Rough Edges
No ratings yet
The Future of Augmented Reality: Hololens Microsoft'S Ar Headset Shines Despite Rough Edges
1 page
Statistics: a QuickStudy Laminated Reference Guide
From Everand
Statistics: a QuickStudy Laminated Reference Guide
BarCharts Publishing, Inc.
No ratings yet

Descriptive Statistics

Uploaded by

Descriptive Statistics

Uploaded by

Descriptive Statistics

The process of statistics:

Describing and summarizing datasets

Organization of categorical data

Category (color) Tally Frequency Relative frequency

Note that total frequency = 45 and total relative frequency = 1.

Visual representation of categorical data

R code for Bar graph of M&M data

b a r p l o t ( Gems$Freq , main=" bar graph o f M & M c o l o r s " , x l a b =" C o l o r s " ,

R code for Pareto graph of M&M data

Marital status Frequency Relative frequency

Note that total frequency = 500.

R code for pie chart of marital status data

freq = c (125 ,290 ,30 ,55)

Category (No. of Cars) Frequency

R codes for barplot of car data

b a r p l o t ( c a r s $ F r e q , main=" bar graph o f number o f c a r s " , x l a b ="number o f c a r s " ,

Visual representation of numerical data

Steps to construct a grouped frequency distribution and draw a histogram

R code for drawing a histogram of exam score data

h i s t ( s c o r e , n c l a s s =7 , main=" Histogram o f s c o r e s with c l a s s width =10" , x l a b =" s c o r e s " ,

R code for stem and leaf plot of age data

Remark 1: The population median, m, corresponding to a random variable X is such that P (X ≤

(a) (b) (c)

Other measures of location: percentiles, quartiles and quantiles

If X is a continuous random variable, then qα is a solution of F (qα ) = P (X ≤ qα ) = α.

An algorithm to compute α-th sample quantile:

R code for finding sample quantiles of rainfall data

x = {2, 4, −1, −2, 0, 9, 20, 7, 1, 5}

Measures of dispersion (range, variance, standard deviation, IQR)

When mx is either mean or median, M is sometimes referred as MAD.

Lower fence = Q1 − 1.5 ∗ IQR

Five number summary and boxplot

Moments, measures of skewness and kurtosis

Note: Sample mean x = m01 . Sample variance S 2 = n−1 1

is known as (Fisher-Pearson’s) coefficient of skewness.

Q. Suppose data values x1 , x2 , . . . , xn are linearly transformed as y1 , y2 , . . . , yn where yi = a + bxi

(b) What percentage of SAT scores is greater than 743

Paired datasets and the sample correlation coefficient

• Qs Is there a numerical measure that determines linear relationship between variables?

The sample correlation coefficient, also called Pearson’s correlation coefficient, is

where equality holds iff yi = a + bxi for i = 1, . . . , n and a, b ∈ R.

• If we interchange x and y variables, does the correlation change?

cov(x, Y ) E[(X − µX )(Y − µY )]

where µX = E[X], µY = E[Y ], σX = V ar(X), σY = V ar(Y ).

Spearman’s Rank Correlation:

Spearman’s rank correlation coefficient rSP is defined as

R code for Pearson’s and rank correlations

You might also like