0% found this document useful (0 votes)
38 views31 pages

DOM105 Session 1

Uploaded by

Vidit Dixit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views31 pages

DOM105 Session 1

Uploaded by

Vidit Dixit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

DOM105 2019

Session 1
Reading: SfM Ch.2,3
Categorical and Numerical Data
Categorical data is data that is separated
into various groupings or categories for
display.
Takes form of tables, bar charts, pie
charts, etc.
Numerical data comprises of numbers that
have not been separated into categories.
Displays of numerical data include arrays,
frequency distributions, scatter plots, etc.
Both types of data can be displayed using
some types of tables such as Pivot Tables.
Summary table
Tallies the values of various categories as frequencies and
percentages for each category.
Contingency Table
Cross-tabulates, or tallies jointly, the value
of two or more categorical variables,
allowing the study of patterns. Tallies of
frequency, or percentages.
Display Categorical Data – Bar Chart

Investor's Portfolio

Savings
CD

Bonds
Stocks

0 10 20 30 40 50
Amount in K$
Pie Chart – Investor’s portfolio
Savings
15%
Stocks
42%
CD
14%

Percentages are rounded to


Bonds the nearest percent.
29%
Side by side Bar Chart
Uses sets of bars to show joint responses from two or more
categorical variables.

Comparing Investors

Savings

CD

Bonds

Stocks

0 10 20 30 40 50 60

Investor A Investor B Investor C


Tabulating Numerical Data: Frequency Distributions

Sort raw data in ascending order:


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Find range: 58 - 12 = 46
Select number of classes: 5 (usually between 5 and 15)
Compute class interval (width): 10 (46/5 then round up)
Determine class boundaries (limits): 10, 20, 30, 40, 50, 60
Compute class midpoints: 15, 25, 35, 45, 55
Count observations & assign to classes
Frequency Distributions, Relative Frequency
Distributions and Percentage Distributions

Data in ordered array:


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Relative
Class Frequency Frequency Percentage
10 but under 20 3 .15 15
20 but under 30 6 .30 30
30 but under 40 5 .25 25
40 but under 50 4 .20 20
50 but under 60 2 .10 10
Total 20 1 100
© 2002 Prentice-Hall, Inc.
Graphing Numerical Data:
The Histogram

Data in ordered array:


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Histogram

7 6
Frequency 6 5
5 4 No Gaps
4 3 Between
3 2 Bars
2
1 0 0
0
5 15 25 36 45 55 More

Class Boundaries
Class Midpoints
Tabulating Numerical Data:
Cumulative Frequency

Data in ordered array:


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Cumulative Cumulative
Class Frequency % Frequency
10 but under 20 3 15
20 but under 30 9 45
30 but under 40 14 70
40 but under 50 18 90
50 but under 60 20 100
Graphing Numerical Data:
The Ogive (Cumulative % Polygon)

Data in ordered array:


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Ogive

100
80
60
40
20
0
10 20 30 40 50 60

Class Boundaries (Not Midpoints)


Graphing Two Numerical Variables - Scatter Plot

Mutual Funds Scatter Plot


40
Total Year to
Date Return
(%) 30
20
10
0
0 10 20 30 40
Net Asset Values
Time-Series Plot
Numerical variable on Y-axis, associated time period on X-
axis.
Errors in visualizing data
Using “chart junk”, visual effects that distort
or distract from the data to be presented, eg:
garish graphics, irrelevant visuals.
Failing to provide a relative basis in
comparing data between groups. For
example, two separate pie charts showing the
operations of two companies does not help if
we’re trying to compare the two.
Compressing the vertical axis – using an axis
going up to 100 when the highest value is 30.
Make sure to include zero on the axes.
Measures of Central Tendency
Most sets of data show a tendency to
group around a central value, this is the
‘central tendency’.
The most common measures of central
tendency are mean, median, and mode.
The mean, also called the arithmetic
mean, is the average of all values in the
sample space.
n

X i
X1  X 2    X n
X i 1

n n
n is the size of the sample.
Median

 Robust measure of central tendency


 Not affected by extreme values
 In an ordered array, the median is
the “middle” number
 Median: (n+1)/2 ranked value.
 If n is odd, the median is the middle
number.
 If n is even, the median is the
average of the two middle numbers.
Mode
A measure of central tendency
It is the value that occurs most often
in the sample.
Not affected by extreme values
Used for either numerical or
categorical data
There may be no mode if all values
have the same frequency
There may be several modes if more
than one value are tied for the
highest frequency.
Variation and shape of data: Range
 Measure of variation
 Difference between the largest and the
Range
smallest  X Largest  X Smallest
observations:

 Ignores the way in which data are distributed

 Does not consider how the values cluster


between extremes.
Quartiles
 
Quartiles split data into 4 parts.
1st Quartile splits the lowest 25% of the values from the rest.

3rd Quartile splits the lowest 75% of the values from the rest.

Q2 is the median.


If the rank is a half (2.5th, 7.5th etc.) then the quartile is
average of the two values on either side. If the rank is a
fraction other than half, round to nearest integer.
Interquartile Range
Measure of variation
Also known as midspread
Spread in the middle 50%
Difference between the first and third quartiles

Not affected by extreme values


Data in Ordered Array: 11 12 13 16 16 17 17 18 21

Interquartile Range  Q3  Q1  17.5  12.5  5


Percentile
To find top xth percentile, we use same method as quartile.
List data in ascending order
xth percentile = Data in rank (n+1)x/100, where n is
number of data points.
In case of fractional value of rank, use unitary method to
find value.
Eg: 80th percentile out of 30 data points would be
31*0.8=24.8th rank.
Value would be 24th data point * 0.2 + 25th data point * 0.8
Variance
Important

  measure of variation
Shows variation about the mean
Is the average of the square of the difference between an element and the
mean
n

 X X
2
i
S2  i 1
Sample variance: n 1 , is the sample mean.
N

 Xi   
2

2  i 1

N
Population variance: , µ is the population
mean.
Standard Deviation
Most important measure of variation
Shows variation about the mean
Has the same units as the original data
Is the square root of the variance
n

 X X
2
Sample standard deviation: i
S i 1

n 1
N

 Xi   
2
Population standard deviation:
 i 1

N
Why do we divide by (n-1) for sample variance?
 For a sample variance to be unbiased, the average variance for
all possible samples for a given population has to be equal to
the population variance.
 It was mathematically shown that if the sample variance was
calculated using n instead of n-1, the average variance of all
possible samples was not equal to population variance.
 This is called Bessel’s correction.
 Only used when the population mean and variance is unknown.
Shape of a Distribution
Describes how data is distributed
Measures of shape
Symmetric or skewed

Left-skewed or negative Right-skewed or positive


Symmetric
Mean < Median < Mode Mean = Median =Mode Mode < Median < Mean
5-number summary and the Box Plot
The 5 numbers: smallest X, Q , Q , Q , largest X
1 2 3
Boxplot
Graphical display of data using 5-number summary

Median( Q2 ) Xlargest
X smallest Q Q3
1

4 6 8 10 12
Distribution Shape and the Boxplot

Left-Skewed Symmetric Right-Skewed

Q1 Q2 Q3 Q1 Q2Q3 Q1 Q2 Q3
Measuring skewness
 Skewness is the measure of asymmetry in a data
distribution. One method of calculating it is adjusted Fisher
Pearson coefficient, as follows:

A symmetrical distribution like a normal distribution will


have . Negative value indicates left-skewed data, positive
indicates right-skewed.
Presence of extreme outliers can distort value of G, giving
erroneous results.
Measuring kurtosis
Kurtosis is a measure of how ‘heavy’ the tails of a data set
are, i.e., how many outliers are present, relative to a
normal distribution.

Normal distribution has kurtosis = 0. Higher kurtosis


indicates large number of outliers, lower means few
outliers.
Like skewness, extreme outliers can distort kurtosis values.
Pitfalls and Ethical Considerations
Data analysis is objective
Should report the summary measures that best meet the
assumptions about the data set
Data interpretation is subjective
Should be done in fair, neutral and clear manner

Numerical descriptive measures:


Should document both good and bad results
Should be presented in a fair, objective and neutral manner
Should not use inappropriate summary measures to distort facts

You might also like