Quantitative AnalysisJD
Quantitative AnalysisJD
1
Disclaimer
This presentation is purely for academic purpose and does not carry
any commercial value.
All non-academic images used in this presentation are property of
respective image holder(s). Images are used only for indicative
purpose and does not carry any other meaning.
2
Please follow this…
3
Quantitative Analysis
Descriptive Statistics
Data Analysis with Excel
www.pibm.in
4
TEXT BOOK
Hector Guerrero, “Excel Data Analysis -Modeling and Simulation”,
Springer-Verlog
Camm, Cochran, Fry, Ohlmann, Anderson, Sweeney, Willians,
“Essentials of Business Analytics”, CenageLearning
5 Measures of Association
7
Descriptive Statistics
A set of numbers that ‘describe’ a data, that is set of numbers that summarizes
the data is called as Descriptive Statistics.
Mean, Standard Deviation, Quartiles, range, Kurtosis, Skewness
For example average rain in city, average salary increase in Organization 3% etc.
Measures of Dispersion
How ‘spread-out’ or ‘dispersed’ the data is
8
Measures of Central Tendency
Mean (the Arithmetic mean)
The average, or mean, of a set of data.
Median
The Median of a set of ordered observations is a middle number that
divides the data into two parts
Mode
most often occurring value in the data observations.
9
Average or Mean is most popular statistics
When is a Median a better summary description of data as compared to the
Mean?
10
Measures of Central Tendency
Let's take a seven employee small firm with the following salaries
C D Excel Formula
1 Employee Salary
2 1 $28,000 mean = AVERAGE(D2:D8) = $86,429
3 2 $33,000 median = MEDIAN(D2:D8) = $34,000
4 3 $33,000
5 4 $34,000 mode = MODE.SNGL(D2:D8) = $33,000
6 5 $37,000
7 6 $40,000
8 7 $400,000
9 Total $86,429
11
Mean versus Median
12
Mean versus Median
14
Mode
The mode is the most frequently occurring value in a set of data
Not a very relevant descriptive statistic when the data is essentially
continuous. Daily exchange rate of Dollar to Euro
15
Measures of Dispersion / Spread
A standard deviation is the statistical measure of the degree of variation of
observations relative to the mean of all the observations
Variance = (Standard Deviation)2
Range = Maximum of data - Minimum of data
Inter Quartile Range
16
Measures of Dispersion / Spread
Firm 1 Firm2
1 34,500 35,800
2 30,700 25,500
3 32,900 31,600
4 36,000 41,700
5 34,100 35,300
6 33,800 33,800
7 32,500 30,800
Mean 33,500 33,500
Median 33,650 33,650
17
Standard Deviation
In this example the mean and median is same. So not possible to compare
two data. Measure of Dispersion is used to find which data is more
consistent
Excel Formula
Std Dev Sample = STDEV.S(B2:B8)
Std Dev Population = STDEV.P(B2:B8)
18
Standard Deviation
Understanding the Standard Deviation measure…
Rule of Thumb
Approximately 68% of the data lie within one standard deviation, and
approximately 95% lie within 2 standard deviations from the mean
Mean of variable Sales = $58.31 Std Dev of Sales = $31.39 (Sales_data.xlsx)
The manager can describe that 68% of data is having sales between
($58.31- $31.39, $58.31+ $31.39) = ($26.91,$89.70)
and 95% of data is having sales between
($58.31- 2*$31.39, $58.31+2* $31.39) = (-$4.48,$121.10) = (0,$121.10)
19
The ‘Inter Quartile Range’ measure
Excel Formula
Quartile = QUARTILE.INC(array, quart)
IQR = QUARTILE.INC(array, 3) - QUARTILE.INC(array, 1)
20
Kurtosis and skewness
Kurtosis (peakedness) and skewness (asymmetry) are measures related to
the shape of a data organized into a frequency distribution.
21
Skewness
It is the degree of distortion from the symmetrical bell curve or the normal
distribution. It measures the lack of symmetry in data distribution.
A symmetrical distribution will have a skewness of 0.
Negative skew: The left tail is longer; The
distribution is said to be left-skewed. In
such a distribution. The mean is lower
than the median. The skewness is lower
than zero.
mean < median <mode 𝜷𝟏 < 𝟎
positive skew: The right tail is longer. The
distribution is said to be right-skewed.
The mean is greater than the median.
The skewness is greater than zero.
mean > median >mode 𝜷𝟏 > 𝟎
22
Skewness =
23
Kurtosis =
24
Histogram
F4: =MIN(B4:B11)
F5: =QUARTILE(B4:B11,1)
F6: =MEDIAN(B4:B11)
F7: =QUARTILE(B4:B11,3)
F8: =MAX(B4:B11)
30
Box plot
31
Consider data set:
{1.25, 1.5, 2.5, 2.5, 3.1, 3.2, 4.1, 4.25, 4.75, 4.8, 4.95, 5.1}
minimum = 1.25, maximum = 5.1, n= no. of observation = 12
32
Five Number summary
Find five number summary of following data data set and plot Box plot
{3, 7, 8, 5, 12, 14, 21, 13, 18}.
five-number summary:
Minimum= 3
Q1 = 6
Median= 12
Q3 = 16
Maximum= 21.
33
Application of Box plot and Whiskers
Box plot is visual representation of mean & various quartiles of data.
Rectangular Box has length equal to inter-quartile range bounded by 1st
quartile and 3rd quartile.. Horizontal line is median. Dot represents mean of
data. This visual becomes particularly useful when you are comparing two
data sets.
Example you want to compare earnings of male and female stars.
Side-by-side Box plot comparison is very useful. Earnings are in million
dollars.
Here we are not going to draw box plot from data but we will find
importance of descriptive statistics by comparing two plots
34
35
Application of Box plot and Whiskers
As per the plots, highest paid female star earns almost same as highest paid
male star. However spread of female star is more than male star, as by evidence
by range which is difference between maximum and minimum, and also inter-
quartile range which is the height of rectangular Box.
Another interesting observation is that mean earning is greater than median
earnings for female stars. While for male stars mean earning is less than median
earnings.
This implies that earning of female stars tends to be much more right skewed
than as compare to male stars.
There are few female star have very high earnings pushing overall mean to be
excess of median earnings.
Thus we have covered two measures of dispersion in data range, inter-quartile
range.
36
)
1. Calculate the 1st and 3rd quartiles (we’ll be talking about what those are in
just a bit).
Q1 = QUARTILE.INC(B2:B14,1) Q3 = QUARTILE.INC(B2:B14,3)
2. Evaluate the interquartile range (we’ll also be explaining these a bit further
down).
Inter quartile Range = IQR= QUARTILE.INC(B2:B14,3)- QUARTILE.INC(B2:B14,1)
With time series data, we are particularly interested in how the data varies
over time and in identifying patterns that occur systematically over time.
39
Types of Data - 2) Cross-sectional data
Cross-sectional data is data that is taken at a single point in time or under
circumstances where time, as a dimension, is irrelevant.
40
Types of Statistics
Statistics can be broken down into two areas:
Descriptive statistics: describes and summarizes data. You are just
describing what the data shows: a trend, a specific feature, or a
certain statistic (like a mean or median).
Inferential statistics: uses statistics to make predictions about population
characteristics.
41
Descriptive Statistics
42
What is Descriptive Statistics in Excel?
43
Steps to Enable Descriptive Statistics in Excel
(Analysis Toolpak )
Step 1: Go to
File > Options.
Step 2: Go to Add-ins
Step 3: Under Add-ins
on the right-hand side
you will see all the
inactive Applications.
Select Analysis Toolpak
and click on GO.
44
Analysis Toolpak
Step 4: Now you will all the
add-ins available for your
excel. Select Analysis Toolpak
and click on OK.
Now you must see the Data
Analysis option under the
Data tab.
Click on Data Analysis you will
see all the available analysis
techniques like Anova, T-Test,
F-test, Correlation, Histogram,
Regression, Descriptive
Statistics and many more
under this tool. 38
Descriptive Statistics in Excel
Now, look at the simple data from a test which includes the scores of 10
students. Using this data of scores we need to the Descriptive Statistics data
analysis.
Step 1: Go to Data > Data Analysis.
Step 2: Once you click on Data Analysis you will list of all the available analysis
techniques. Scroll down and select Descriptive Statistics.
Step 3: Under Input Range select the range of Scores including heading, Check
Labels in the first row, Select Output range and give cell reference and check
Summary statistics.
Step 4: Click on OK to complete the task. In referenced cell you will see the
summary report of Descriptive Statistics data analysis.
46
Descriptive Statistics
Marks
Student
Marks
RollNo Mean 70.20
1 72 Standard Error 5.05
Median 74.50
2 91 Mode 72.00
3 77 Standard Deviation 15.97
4 80 Sample Variance 255.07
Kurtosis -1.01
5 72 Skewness -0.65
6 46 Range 45.00
Minimum 46.00
7 81 Maximum 91.00
8 54 Sum 702.00
9 83 Count 10.00
10 46
47
Descriptive Statistics
A B C D E F G
Name Gender Age Height (Fts) Weight (Kg) Study Hrs Exam Score
1
2 Ram 1 Male 24 4.64 52 6 46
3 Ram 2 Male 26 4.34 85 8 47
4 Ram 3 Male 26 4.56 68 11 42
5 Raj 1 Male 23 6.17 82 11 40
6 Raj 2 Male 27 4.35 88 12 30
7 Raj 3 Male 32 5.33 60 7 62
8 Ramya 1 Female 27 6.44 85 8 83
9 Ramya 2 Female 22 5.98 50 9 48
10 Ramya 3 Female 26 5.74 64 12 95
11 Ramya 4 Male 25 4.62 59 8 56
12 Ramya 5 Male 27 6.25 57 11 47
13 Ajith 1 Male 26 4.93 66 12 56
14 Ajith 2 Male 25 5.47 79 9 55
15 Ajith 3 Male 30 4.49 52 11 86
16 Ajith 4 Female 23 6.29 76 11 32
17 Ajith 5 Female 25 5.00 59 9 69
18 Karan 1 Female 32 4.09 88 12 47
19 Karan 2 Female 23 6.10 50 8 65
20 Karan 3 Female 32 6.06 77 8 77
21 Karan 4 Male 32 5.43 73 11 56
22 Karan 5 Male 24 4.29 49 8 95
23 Karan 6 Male 28 4.09 56 6 49
24 Karan 7 Male 27 5.67 72 12 56
25 Karan 8 Male 28 6.63 87 7 51
26 Karan 9 Male 26 4.14 52 7 55 4
Descriptive Statistics
42
Descriptive Statistics
4
Example : Women's Health Survey
Let us take a look at an example. In 1985, the USDA commissioned a study of
women’s nutrition. Nutrient intake was measured for a random sample of 737
women aged 25-50 years. The following variables were measured: Nutrient.xlxs
Calcium(mg)
Iron(mg)
Protein(g)
Vitamin A(μg)
Vitamin C(mg)
51
Nutrient.xlsx
Calcium(mg) Iron (mg) Protein(g) Vitamin A(mg) Vitamin A(mg)
1 522.29 10.188 42.561 349.13 54.141
2 343.32 4.113 67.793 266.99 24.839
3 858.26 13.741 59.933 667.9 155.455
4 575.98 13.245 42.215 792.23 224.688
5 1927.5 18.919 111.316 740.27 80.961
6 607.58 6.8 45.785 165.68 13.05
7 1046.19 18.433 116.418 1119.59 158.986
8 181.21 12.762 64.156 78.49 26.942
9 327.08 8.693 49.161 568.38 49.977
10 383.09 13.667 103.844 1029.2 8.404
11 1227.58 16.81 107.698 623.32 37.063
12 845.44 7.417 56.519 273.16 21.692
13 460.52 12.375 52.126 204.02 139.515
14 349.56 16.074 50.679 597.67 53.576
15 74.46 6.838 77.878 6.48 69.714
16 280.31 14.548 88.798 556.27 100.92
17 422.36 21.638 148.713 418.03 263.039
52
Descriptive Statistics
Example: Nutrient Intake Data - Descriptive Statistics The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
Calcium 737 624.0492537 397.2775401 7.4400000 2866.44
53
CALCIUM(MG) IRON(MG) PROTEIN(G) VITAMIN A(ΜG) VITAMIN C(MG)
Mean 624.05 Mean 11.13 Mean 65.80 Mean 839.64 Mean 78.93
Standard Error 14.63 Standard Error 0.22 Standard Error 1.13 Standard Error 60.17 Standard Error 2.71
Median 548.29 Median 10.03 Median 61.07 Median 524.03 Median 53.59
Mode #N/A Mode 8.95 Mode 89.24 Mode 0.00 Mode 0.00
Standard Standard Standard Standard Standard
Deviation 397.28 Deviation 5.98 Deviation 30.58 Deviation 1633.54 Deviation 73.60
157829. Sample Sample 2668452. Sample
Sample Variance 44 Variance 35.81 Sample Variance 934.88 Variance 37 Variance 5416.26
Kurtosis 2.61 Kurtosis 11.54 Kurtosis 3.15 Kurtosis 250.80 Kurtosis 2.73
Skewness 1.31 Skewness 2.30 Skewness 1.14 Skewness 13.41 Skewness 1.60
Range 2859.00 Range 58.67 Range 251.01 Range 34434.27 Range 433.34
Minimum 7.44 Minimum 0.00 Minimum 0.00 Minimum 0.00 Minimum 0.00
Maximum 2866.44 Maximum 58.67 Maximum 251.01 Maximum 34434.27 Maximum 433.34
459924. 618811.2 58170.2
Sum 30 Sum 8202.74 Sum 48497.14 Sum 5 Sum 7
Count 737.00 Count 737.00 Count 737.00 Count 737.00 Count 737.00
54
Measure of Association
Correlation Coefficient
Suppose X denotes the number of cups of hot chocolate sold daily at a local café,
and Y denotes the number of apple cinnamon muffins sold daily at the same café.
Then, the manager of the café might benefit from knowing whether X and Y are
highly correlated or not. If the random variables are highly correlated, then the
manager would know to make sure that both are available on a given day. If the
random variables are not highly correlated, then the manager would know that it
would be okay to have one of the items available without the other.
X = Number of hot chocolate cup sold daily at a local café
Y = Number of apple cinnamon muffins sold daily at same café
Correlation coefficient is a measure that describes the strength and direction of a
relationship between two variables.
55
Covariance
Let X , Y be two random variable. If we want to quantify the dependence
between two random variables X and Y , we have to calculate covariance
between the two random variables.
56
Correlation Coefficient
57
Interpretation of Correlation
58
Interpretation of Correlation
59
It shows a strong negative correlation (about -0.97) between the average
monthly temperature and the number of heaters sold:
60
Sample Covariance Cov(X,Y) =
For example, the variance of iron intake is . S22 = 35. 8 mg2.
the covariance between calcium and iron intake is s12 = 940. 1.
Covariance Matrix
Calcium
I
(mg)
Calcium (mg) 157615.29
Iron (mg) 938.81
Protein (g) 6067.572
Vitamin A
102272.17 2
(μg)
Vitamin C
6692.523
(mg)
61
Sample Correlation
The results suggest that protein, iron, and calcium are all
positively associated. Each of these three nutrients intake
increases with increasing values of the remainingtwo.
Correlation Matrix
Vitamin Vitamin
Calcium(mg) Iron(mg) Protein(g)
A(μg) C(mg)
Calcium(mg) 1.0000
Iron(mg) 0.3954 1.0000
Protein(g) 0.5002 0.6234 1.0000
Vitamin A(μg) 0.1578 0.2438 0.1468 1.0000
Vitamin C(mg) 0.2292 0.3126 0.2121 0.1835 1.0000
62
coefficient of determination
The coefficient of determination is another measure of association and is
simply equal to the square of the correlation. For example, in this case, the
coefficient of determination between protein and iron is or about
r23 2 = (0.62337)2 = 0.388.
This says that about 39% of the variation in iron intake is explained by
protein intake. Or, conversely, 39% of the protein intake is explained by the
variation in the iron intake. Both interpretations are equivalent.
63
HOMEWORK PROBLEMS
64