0% found this document useful (0 votes)
95 views64 pages

Quantitative AnalysisJD

The document provides information about a presentation on quantitative analysis and descriptive statistics. It includes: - A disclaimer stating the presentation is for academic purposes only. - A table of contents outlining topics like measures of central tendency, dispersion, skewness, kurtosis and data analysis tools. - References for further reading on the subject. - Details on measures like the mean, median, mode, standard deviation, and using tools like histograms for analysis.

Uploaded by

Verendra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views64 pages

Quantitative AnalysisJD

The document provides information about a presentation on quantitative analysis and descriptive statistics. It includes: - A disclaimer stating the presentation is for academic purposes only. - A table of contents outlining topics like measures of central tendency, dispersion, skewness, kurtosis and data analysis tools. - References for further reading on the subject. - Details on measures like the mean, median, mode, standard deviation, and using tools like histograms for analysis.

Uploaded by

Verendra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

NOTE

 Slides marked as Extra study are not as a part of syllabus.


 Those are provided for add-on knowledge.

1
Disclaimer
 This presentation is purely for academic purpose and does not carry
any commercial value.
 All non-academic images used in this presentation are property of
respective image holder(s). Images are used only for indicative
purpose and does not carry any other meaning.

2
Please follow this…

3
Quantitative Analysis
Descriptive Statistics
Data Analysis with Excel

www.pibm.in

4
TEXT BOOK
Hector Guerrero, “Excel Data Analysis -Modeling and Simulation”,
Springer-Verlog
Camm, Cochran, Fry, Ohlmann, Anderson, Sweeney, Willians,
“Essentials of Business Analytics”, CenageLearning

REFERENCE LINK FOR EXCEL


https://www.excel-easy.com/
https://support.office.com/en-us/article/excel-for-windows-
training-9bc05390-e94c-46af-a5b3-
d7c22f6990bb?wt.mc_id=otc_home&ui=en-US&rs=en-US&ad=US
5
Table of Content
1 Quantitative Analysis.

2 Descriptive Statistics Definition

3 Measure of Central Tendancy , Measure of Dispersion

4 Measure of Skewness, Kurtosis

5 Measures of Association

6 Analysis with Data Analysis ToolPak


6
Data Set
 DiscriptiveStat.xlsx
 Nutrient.xlsx

7
Descriptive Statistics

 A set of numbers that ‘describe’ a data, that is set of numbers that summarizes
the data is called as Descriptive Statistics.
 Mean, Standard Deviation, Quartiles, range, Kurtosis, Skewness
 For example average rain in city, average salary increase in Organization 3% etc.

 Measures of Central Tendency (the various averages)


Some ‘central’ aspect of the data

 Measures of Dispersion
How ‘spread-out’ or ‘dispersed’ the data is

8
Measures of Central Tendency
 Mean (the Arithmetic mean)
The average, or mean, of a set of data.
 Median
The Median of a set of ordered observations is a middle number that
divides the data into two parts
 Mode
most often occurring value in the data observations.

9
 Average or Mean is most popular statistics
 When is a Median a better summary description of data as compared to the
Mean?

10
Measures of Central Tendency
 Let's take a seven employee small firm with the following salaries

C D Excel Formula
1 Employee Salary
2 1 $28,000 mean = AVERAGE(D2:D8) = $86,429
3 2 $33,000 median = MEDIAN(D2:D8) = $34,000
4 3 $33,000
5 4 $34,000 mode = MODE.SNGL(D2:D8) = $33,000
6 5 $37,000
7 6 $40,000
8 7 $400,000
9 Total $86,429

11
Mean versus Median

 The Mean is influenced to a greater extent by extreme observations.


 Mean = 86,000$
 Median = 34,000$
Here median is preferred as central value of data than mean
 Income and Price data generally follow this pattern.

12
Mean versus Median

Consider CEO Salaries, The


Histogram shows right tail , as
right tail is longer.
Mean is greater than the
Median, or equivalently, the
Mean is greater than the Mode;
In this case the skewness is
greater than zero.

mean > median > mode


13
Mean versus Median

mean < median < mode

Consider student’s Score, The


Histogram shows left tail , as left
tail is longer.
Mean is less than the Median, or
equivalently, the Mean is less
than the Mode; In this case the
skewness is less than zero.

14
Mode
 The mode is the most frequently occurring value in a set of data
 Not a very relevant descriptive statistic when the data is essentially
continuous. Daily exchange rate of Dollar to Euro

15
Measures of Dispersion / Spread
 A standard deviation is the statistical measure of the degree of variation of
observations relative to the mean of all the observations
 Variance = (Standard Deviation)2
 Range = Maximum of data - Minimum of data
 Inter Quartile Range

16
Measures of Dispersion / Spread

Firm 1 Firm2
1 34,500 35,800
2 30,700 25,500
3 32,900 31,600
4 36,000 41,700
5 34,100 35,300
6 33,800 33,800
7 32,500 30,800
Mean 33,500 33,500
Median 33,650 33,650

17
Standard Deviation
 In this example the mean and median is same. So not possible to compare
two data. Measure of Dispersion is used to find which data is more
consistent
Excel Formula
Std Dev Sample = STDEV.S(B2:B8)
Std Dev Population = STDEV.P(B2:B8)

Firm 1 Firm2 STDEV.S is generally used


STDEV.S 1678.293 5012.651 to find Standard Deviation
STDEV.P 1553.797 4640.813

18
Standard Deviation
Understanding the Standard Deviation measure…
Rule of Thumb
 Approximately 68% of the data lie within one standard deviation, and
approximately 95% lie within 2 standard deviations from the mean
Mean of variable Sales = $58.31 Std Dev of Sales = $31.39 (Sales_data.xlsx)
 The manager can describe that 68% of data is having sales between
($58.31- $31.39, $58.31+ $31.39) = ($26.91,$89.70)
and 95% of data is having sales between
($58.31- 2*$31.39, $58.31+2* $31.39) = (-$4.48,$121.10) = (0,$121.10)

19
The ‘Inter Quartile Range’ measure

Excel Formula
Quartile = QUARTILE.INC(array, quart)
IQR = QUARTILE.INC(array, 3) - QUARTILE.INC(array, 1)

20
Kurtosis and skewness
 Kurtosis (peakedness) and skewness (asymmetry) are measures related to
the shape of a data organized into a frequency distribution.

Coefficient of Skewness = 𝛃𝟏 = 𝐒𝐊𝐄𝐖 𝐚𝐫𝐫𝐚𝐲

Coefficient of 𝐊𝐮𝐫𝐭𝐨𝐬𝐢𝐬 = 𝛃𝟐 = KURT(array)

21
Skewness
It is the degree of distortion from the symmetrical bell curve or the normal
distribution. It measures the lack of symmetry in data distribution.
A symmetrical distribution will have a skewness of 0.
Negative skew: The left tail is longer; The
distribution is said to be left-skewed. In
such a distribution. The mean is lower
than the median. The skewness is lower
than zero.
mean < median <mode 𝜷𝟏 < 𝟎
positive skew: The right tail is longer. The
distribution is said to be right-skewed.
The mean is greater than the median.
The skewness is greater than zero.
mean > median >mode 𝜷𝟏 > 𝟎
22
Skewness =

23
Kurtosis =

 It is clear from the figure that all the


three curves, (1), (2) and (3) are
symmetrical about the mean. Still
they are not of the same type. One
has different peak as compared to
that of others.
 Curve (1) is known as mesokurtic
(normal curve) 𝜷𝟐= 𝟎
 Curve (2) is known as leptocurtic
(peaked curve) and 𝜷𝟐> 𝟎
 Curve (3) is known as platykurtic (flat
curve). 𝜷𝟐< 𝟎

24
Histogram

A histogram is used to summarize discrete or continuous data.


It provides a visual interpretation of numerical data by showing the number
of data points that fall within a specified range of values (called “bins”).
It is similar to a vertical bar graph. However, a histogram, unlike a vertical
bar graph, shows no gaps between the bars.
Histogram is used to find shape or skewness of data
Shape of Histogram can be Symmetric, Left Skewed, right Skewed, Modal,
multi modal or Uniform.
Histogram
1. First, enter the bin numbers (upper levels) in the range C4:C8.
2. On the Data tab in Analysis group > Data Analysis >Histogram > OK.
3. Select the range A2:A19.
4. Click in the Bin Range box and select the range C4:C8.
5. Click the Output Range option button, click in the Output Range box and
select cell F3.
6. Check Chart Output.
Histogram
7. Properly label your bins.
8. To remove the space
between the bars, right click a
bar, click Format Data Series
and change the Gap Width to
0%.
Or
8. Select Chart, goto design,
Quick Layout.
Select Histogram.
9. To add borders, right click a
bar, click Format Data Series,
click the Fill & Line icon, click
Border and select a color.
Histogram
Box & Whiskers Plot
 With a Box Plot (box and whisker chart), you can see the distribution of
numbers in data. What are the highest and lowest numbers? What was the
median number? What was the range of numbers on either side of the
median?.

F4: =MIN(B4:B11)
F5: =QUARTILE(B4:B11,1)
F6: =MEDIAN(B4:B11)
F7: =QUARTILE(B4:B11,3)
F8: =MAX(B4:B11)

30
Box plot

31
Consider data set:
{1.25, 1.5, 2.5, 2.5, 3.1, 3.2, 4.1, 4.25, 4.75, 4.8, 4.95, 5.1}
minimum = 1.25, maximum = 5.1, n= no. of observation = 12

Determine the quartiles


Median = average of 6th & 7th observation = (3.2+4.1)/2 = 3.65
Q1 = first quartile = 25th percentile = average of 3rd & 4th observation = (2.5+2.5)/2=2.5
Q3 = third quartile = 75th percentile = average of 9th & 10th observation = (4.75+4.95)/2 =4.775

32
Five Number summary
Find five number summary of following data data set and plot Box plot
{3, 7, 8, 5, 12, 14, 21, 13, 18}.
 five-number summary:
Minimum= 3
Q1 = 6
Median= 12
Q3 = 16
Maximum= 21.

33
Application of Box plot and Whiskers
 Box plot is visual representation of mean & various quartiles of data.
Rectangular Box has length equal to inter-quartile range bounded by 1st
quartile and 3rd quartile.. Horizontal line is median. Dot represents mean of
data. This visual becomes particularly useful when you are comparing two
data sets.
 Example you want to compare earnings of male and female stars.
 Side-by-side Box plot comparison is very useful. Earnings are in million
dollars.
 Here we are not going to draw box plot from data but we will find
importance of descriptive statistics by comparing two plots

34
35
Application of Box plot and Whiskers
 As per the plots, highest paid female star earns almost same as highest paid
male star. However spread of female star is more than male star, as by evidence
by range which is difference between maximum and minimum, and also inter-
quartile range which is the height of rectangular Box.
 Another interesting observation is that mean earning is greater than median
earnings for female stars. While for male stars mean earning is less than median
earnings.
 This implies that earning of female stars tends to be much more right skewed
than as compare to male stars.
 There are few female star have very high earnings pushing overall mean to be
excess of median earnings.
 Thus we have covered two measures of dispersion in data range, inter-quartile
range.

36
)

To Find Outliers in your Data

1. Calculate the 1st and 3rd quartiles (we’ll be talking about what those are in
just a bit).
Q1 = QUARTILE.INC(B2:B14,1) Q3 = QUARTILE.INC(B2:B14,3)

2. Evaluate the interquartile range (we’ll also be explaining these a bit further
down).
Inter quartile Range = IQR= QUARTILE.INC(B2:B14,3)- QUARTILE.INC(B2:B14,1)

3. Return the upper and lower bounds of our data range.


Upper bound = Q3 + 1.5* Q1, Lower bound = = Q3 -1.5* Q1

4. Use these bounds to identify the outlying data points.


Outliers
Types of Data - 1) Time Series Data
 Time series data is data that is chronologically ordered, and it is one of the
most frequently encountered types of data in business.

 With time series data, we are particularly interested in how the data varies
over time and in identifying patterns that occur systematically over time.

 Behavior like seasonality, co-relationship of one series to another, or one


series displaying leading or lagging time behavior with respect to another,
trend, cyclic are relatively easy to observe.

39
Types of Data - 2) Cross-sectional data
 Cross-sectional data is data that is taken at a single point in time or under
circumstances where time, as a dimension, is irrelevant.

 For cross-sectional data, measures of central tendency, measures of variation,


measures of skewness and kurtosis of data, and correlation between two datasets
are observed as Descriptive Statistics.

 With the help of charts we can visualize behavior of data.

 Correlation analysis will make understanding the linear co-variation or co-relation


between two variables much easier, because it is measured in values that are
standardized between the range of –1 and 1.

40
Types of Statistics
Statistics can be broken down into two areas:
 Descriptive statistics: describes and summarizes data. You are just
describing what the data shows: a trend, a specific feature, or a
certain statistic (like a mean or median).
 Inferential statistics: uses statistics to make predictions about population
characteristics.

41
Descriptive Statistics

 Descriptive Statistics in excel will help us to describe and understand


the features of specific data in summary way.
 The most common types of descriptive statistics in excel are mean,
median, mode, variance, standard deviation etc.
 The measure of central tendency of data is mean, median, mode.
 The measure of dispersion of data is variance, standard deviation,
range.

42
What is Descriptive Statistics in Excel?

 To summarize an information available in statistics is known as descriptive


statistics and in excel also we have a function for descriptive statistics, this
inbuilt tool is located in the data tab and then in the data analysis and we
will find the method for the descriptive statistics, this technique also
provides us with various types of output options.
The mean, mode, median and range
Variance and standard deviation
Sample Variance
Skewness
Kurtosis
Count, maximum and minimum
Quintiles and Percentile

43
Steps to Enable Descriptive Statistics in Excel
(Analysis Toolpak )
 Step 1: Go to
File > Options.
 Step 2: Go to Add-ins
 Step 3: Under Add-ins
on the right-hand side
you will see all the
inactive Applications.
Select Analysis Toolpak
and click on GO.

44
Analysis Toolpak
 Step 4: Now you will all the
add-ins available for your
excel. Select Analysis Toolpak
and click on OK.
Now you must see the Data
Analysis option under the
Data tab.
 Click on Data Analysis you will
see all the available analysis
techniques like Anova, T-Test,
F-test, Correlation, Histogram,
Regression, Descriptive
Statistics and many more
under this tool. 38
Descriptive Statistics in Excel
Now, look at the simple data from a test which includes the scores of 10
students. Using this data of scores we need to the Descriptive Statistics data
analysis.
Step 1: Go to Data > Data Analysis.
Step 2: Once you click on Data Analysis you will list of all the available analysis
techniques. Scroll down and select Descriptive Statistics.
Step 3: Under Input Range select the range of Scores including heading, Check
Labels in the first row, Select Output range and give cell reference and check
Summary statistics.
Step 4: Click on OK to complete the task. In referenced cell you will see the
summary report of Descriptive Statistics data analysis.
46
Descriptive Statistics
Marks
Student
Marks
RollNo Mean 70.20
1 72 Standard Error 5.05
Median 74.50
2 91 Mode 72.00
3 77 Standard Deviation 15.97
4 80 Sample Variance 255.07
Kurtosis -1.01
5 72 Skewness -0.65
6 46 Range 45.00
Minimum 46.00
7 81 Maximum 91.00
8 54 Sum 702.00
9 83 Count 10.00

10 46
47
Descriptive Statistics
A B C D E F G
Name Gender Age Height (Fts) Weight (Kg) Study Hrs Exam Score
1
2 Ram 1 Male 24 4.64 52 6 46
3 Ram 2 Male 26 4.34 85 8 47
4 Ram 3 Male 26 4.56 68 11 42
5 Raj 1 Male 23 6.17 82 11 40
6 Raj 2 Male 27 4.35 88 12 30
7 Raj 3 Male 32 5.33 60 7 62
8 Ramya 1 Female 27 6.44 85 8 83
9 Ramya 2 Female 22 5.98 50 9 48
10 Ramya 3 Female 26 5.74 64 12 95
11 Ramya 4 Male 25 4.62 59 8 56
12 Ramya 5 Male 27 6.25 57 11 47
13 Ajith 1 Male 26 4.93 66 12 56
14 Ajith 2 Male 25 5.47 79 9 55
15 Ajith 3 Male 30 4.49 52 11 86
16 Ajith 4 Female 23 6.29 76 11 32
17 Ajith 5 Female 25 5.00 59 9 69
18 Karan 1 Female 32 4.09 88 12 47
19 Karan 2 Female 23 6.10 50 8 65
20 Karan 3 Female 32 6.06 77 8 77
21 Karan 4 Male 32 5.43 73 11 56
22 Karan 5 Male 24 4.29 49 8 95
23 Karan 6 Male 28 4.09 56 6 49
24 Karan 7 Male 27 5.67 72 12 56
25 Karan 8 Male 28 6.63 87 7 51
26 Karan 9 Male 26 4.14 52 7 55 4
Descriptive Statistics

 Step 1: Go to Data > Data Analysis.


 Step 2: Once you click on Data Analysis you will list of all the available
analysis techniques. Scroll down and select Descriptive Statistics.
 Step 3: Under Input Range select all the category range including headings
i.e. $C$1:$G$26.
 Step 4: Give Reference to output where want to show the summary report.
 Step 5: tick the Summary Statistics option
 Step 6: Click on OK

42
Descriptive Statistics

4
Example : Women's Health Survey
Let us take a look at an example. In 1985, the USDA commissioned a study of
women’s nutrition. Nutrient intake was measured for a random sample of 737
women aged 25-50 years. The following variables were measured: Nutrient.xlxs

 Calcium(mg)

 Iron(mg)

 Protein(g)

 Vitamin A(μg)

 Vitamin C(mg)

51
Nutrient.xlsx
Calcium(mg) Iron (mg) Protein(g) Vitamin A(mg) Vitamin A(mg)
1 522.29 10.188 42.561 349.13 54.141
2 343.32 4.113 67.793 266.99 24.839
3 858.26 13.741 59.933 667.9 155.455
4 575.98 13.245 42.215 792.23 224.688
5 1927.5 18.919 111.316 740.27 80.961
6 607.58 6.8 45.785 165.68 13.05
7 1046.19 18.433 116.418 1119.59 158.986
8 181.21 12.762 64.156 78.49 26.942
9 327.08 8.693 49.161 568.38 49.977
10 383.09 13.667 103.844 1029.2 8.404
11 1227.58 16.81 107.698 623.32 37.063
12 845.44 7.417 56.519 273.16 21.692
13 460.52 12.375 52.126 204.02 139.515
14 349.56 16.074 50.679 597.67 53.576
15 74.46 6.838 77.878 6.48 69.714
16 280.31 14.548 88.798 556.27 100.92
17 422.36 21.638 148.713 418.03 263.039
52
Descriptive Statistics
Example: Nutrient Intake Data - Descriptive Statistics The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
Calcium 737 624.0492537 397.2775401 7.4400000 2866.44

Iron 737 11.1298996 5.9841905 0 58.6680000

Protein 737 65.8034410 30.5757564 0 251.0120000

A 737 839.3653460 1633.54 0 34434.27

C 737 78.9284464 73.5952721 0 433.3390000

53
CALCIUM(MG) IRON(MG) PROTEIN(G) VITAMIN A(ΜG) VITAMIN C(MG)

Mean 624.05 Mean 11.13 Mean 65.80 Mean 839.64 Mean 78.93

Standard Error 14.63 Standard Error 0.22 Standard Error 1.13 Standard Error 60.17 Standard Error 2.71

Median 548.29 Median 10.03 Median 61.07 Median 524.03 Median 53.59

Mode #N/A Mode 8.95 Mode 89.24 Mode 0.00 Mode 0.00
Standard Standard Standard Standard Standard
Deviation 397.28 Deviation 5.98 Deviation 30.58 Deviation 1633.54 Deviation 73.60
157829. Sample Sample 2668452. Sample
Sample Variance 44 Variance 35.81 Sample Variance 934.88 Variance 37 Variance 5416.26

Kurtosis 2.61 Kurtosis 11.54 Kurtosis 3.15 Kurtosis 250.80 Kurtosis 2.73

Skewness 1.31 Skewness 2.30 Skewness 1.14 Skewness 13.41 Skewness 1.60

Range 2859.00 Range 58.67 Range 251.01 Range 34434.27 Range 433.34

Minimum 7.44 Minimum 0.00 Minimum 0.00 Minimum 0.00 Minimum 0.00

Maximum 2866.44 Maximum 58.67 Maximum 251.01 Maximum 34434.27 Maximum 433.34
459924. 618811.2 58170.2
Sum 30 Sum 8202.74 Sum 48497.14 Sum 5 Sum 7

Count 737.00 Count 737.00 Count 737.00 Count 737.00 Count 737.00
54
Measure of Association
Correlation Coefficient
 Suppose X denotes the number of cups of hot chocolate sold daily at a local café,
and Y denotes the number of apple cinnamon muffins sold daily at the same café.
Then, the manager of the café might benefit from knowing whether X and Y are
highly correlated or not. If the random variables are highly correlated, then the
manager would know to make sure that both are available on a given day. If the
random variables are not highly correlated, then the manager would know that it
would be okay to have one of the items available without the other.
X = Number of hot chocolate cup sold daily at a local café
Y = Number of apple cinnamon muffins sold daily at same café
Correlation coefficient is a measure that describes the strength and direction of a
relationship between two variables.

55
Covariance
 Let X , Y be two random variable. If we want to quantify the dependence
between two random variables X and Y , we have to calculate covariance
between the two random variables.

56
Correlation Coefficient

Standard Deviation (X) = = STDEV.S(range)

57
Interpretation of Correlation

−1≤ rXY ≤1.


 rXY = 1 means a perfect positive relationship - as one variable increases,
the other increases proportionally.
 rXY = -1 means a perfect negative relationship - as one variable increases,
the other decreases proportionally.
 rXY = 0 means no relationship between two variables - the data points are
scattered all over the graph.
If rXY > 0, then X and Y are positively, linearly correlated, but not perfectly so.
If rXY < 0, then X and Y are negatively, linearly correlated, but not perfectly so.

58
Interpretation of Correlation

59
It shows a strong negative correlation (about -0.97) between the average
monthly temperature and the number of heaters sold:

60
Sample Covariance Cov(X,Y) =
For example, the variance of iron intake is . S22 = 35. 8 mg2.
the covariance between calcium and iron intake is s12 = 940. 1.

Covariance Matrix
Calcium
I
(mg)
Calcium (mg) 157615.29
Iron (mg) 938.81
Protein (g) 6067.572
Vitamin A
102272.17 2
(μg)
Vitamin C
6692.523
(mg)
61
Sample Correlation
The results suggest that protein, iron, and calcium are all
positively associated. Each of these three nutrients intake
increases with increasing values of the remainingtwo.
Correlation Matrix
Vitamin Vitamin
Calcium(mg) Iron(mg) Protein(g)
A(μg) C(mg)

Calcium(mg) 1.0000
Iron(mg) 0.3954 1.0000
Protein(g) 0.5002 0.6234 1.0000
Vitamin A(μg) 0.1578 0.2438 0.1468 1.0000
Vitamin C(mg) 0.2292 0.3126 0.2121 0.1835 1.0000
62
coefficient of determination
 The coefficient of determination is another measure of association and is
simply equal to the square of the correlation. For example, in this case, the
coefficient of determination between protein and iron is or about
r23 2 = (0.62337)2 = 0.388.

 This says that about 39% of the variation in iron intake is explained by
protein intake. Or, conversely, 39% of the protein intake is explained by the
variation in the iron intake. Both interpretations are equivalent.

63
HOMEWORK PROBLEMS

64

You might also like