0% found this document useful (0 votes)
9 views8 pages

Organizing, Visualizing and Describing Data

Uploaded by

adityaharish18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views8 pages

Organizing, Visualizing and Describing Data

Uploaded by

adityaharish18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Reading: Organizing, Visualizing, and Describing Data

LOS a: Identify and compare data types.


We may classify data types from three different perspectives:
 numerical versus categorical,
 time series versus cross sectional,
 and structured versus unstructured.

Numerical, or quantitative, data are values that can be counted or measured and may be discrete or
continuous. Categorical, or qualitative, data are labels that can be used to classify a set of data into
groups and may be nominal or ordinal.

A time series is a set of observations taken at a sequence of points in time. Cross sectional data are a
set of comparable observations taken at one point in time. Time series and cross-sectional data may
be combined to form panel data.

Unstructured data refers to information that is presented in forms that are not regularly structured
and may be generated by individuals, business processes, or sensors.

LOS b: Describe how data are organized for quantitative analysis.


Data are typically organized into arrays for analysis. A time series is an example of a one-dimensional
array. A panel data is an example of a two-dimensional array or data tables.

LOS c: Interpret frequency and related distributions.


A frequency distribution groups observations into classes, or intervals. An interval is a range of values.

Relative frequency is the percentage of total observations falling within an interval.

Cumulative relative frequency for an interval is the sum of the relative frequencies for all values less
than or equal to that interval’s maximum value.

LOS d: Interpret a contingency table.


A contingency table is a two-dimensional array with which we can analyze two variables at the same
time. The rows represent some attributes of one of the variables and the columns represent those
attributes for the other variable. The data in each cell show the joint frequency with which we observe
a pair of attributes simultaneously. The total of frequencies for a row or a column is the marginal
frequency for that attribute.

LOS e: Describe ways that data may be visualized and evaluate uses of specific visualization.
A histogram is a bar chart of data that has been grouped into a frequency distribution.
Organizing, Visualizing, and Describing Data

A frequency polygon plots the midpoint of each interval on the horizontal axis and the absolute
frequency for that interval on the vertical axis, and it connects the midpoints with straight lines.

A cumulative frequency distribution chart is a line chart of the cumulative absolute frequency or the
cumulative relative frequency.

Bar charts can be used to illustrate relative sizes, degrees, or magnitudes.

A grouped or clustered bar chart can illustrate two categories at once.

Page 2 of 8
Organizing, Visualizing, and Describing Data

In a stacked bar chart, the height of each bar represents the cumulative frequency for a category, and
the colors within each bar represent joint frequencies.

A tree map is another method for visualizing the relative sizes of categories.

A word cloud is generated by counting the uses of specific words in a text file. It displays the words
that appear most often, in type sizes that are scaled to the frequency of their use.

Line charts are particularly useful for exhibiting time series.

Page 3 of 8
Organizing, Visualizing, and Describing Data

Multiple time series can be displayed on a line chart if their scales are comparable. It is also possible
to display two time series on a line chart if their scales are different by using left and right vertical
axes.

A technique for adding a dimension to a line chart is to create a bubble line chart.

A scatter plot is a way of displaying how two variables tend to change together. The vertical axis
represents one variable and the horizontal axis represents a second variable. Each point in the scatter
plot shows the values of both variables at one specific point in time.

To analyze three variables at the same time, an analyst can create a scatter plot matrix that consists
of three scatter plots of these variables, each presenting two of the three variables.

A heat map uses colors and shade to display data frequency.

Page 4 of 8
Organizing, Visualizing, and Describing Data

LOS f: Describe how to select among visualization types.


Which chart types tend to be most effective depends on what they are intended to visualize:
 Relationships: Scatter plots, scatter plot matrices, and heat maps.
 Comparisons: Bar charts, tree maps, and heat maps for comparisons among categories; line
charts and bubble line charts for comparisons over time.
 Distributions: Histograms, frequency polygons, and cumulative distribution charts for
numerical data; bar charts, tree maps, and heat maps for categorical data; and word clouds
for unstructured data.

LOS g: Calculate and interpret measures of central tendency.

LOS h: Evaluate alternative definitions of mean to address an investment problem.


The arithmetic mean is the average. It is used to estimate expected value, value of a single outcome
from a distribution.
 Weighted average mean

To reduce the effects of outliers, we calculate:


 Trimmed mean: omits outliers. The revised number of observations reduced as compared to
orginal data set.
 Winsor mean: replaces outliers with given values. The number of observations remains the
same.

The geometric mean is used to calculate or estimate periodic compound returns over multiple periods.

The harmonic mean can be used to find an average purchase price, such as dollars per share for equal
periodic investments.

The median is the midpoint of a data set when the data are arranged from largest to smallest.

The mode of a data set is the value that occurs most frequently.

LOS i: Calculate quantiles and interpret related visualizations.


Quantile is the general term for a value at or below which a stated proportion of the data in a
distribution lies. Examples of quantiles include the following:
 Median: The distribution is divided into half.
 Quartile: The distribution is divided into quarters.
 Quintile: The distribution is divided into fifths.
 Decile: The distribution is divided into tenths.
 Percentile: The distribution is divided into hundredths (percent’s).

Example: What is the third quartile for the following distribution of returns?
8%, 10%, 12%, 13%, 15%, 17%, 17%, 18%, 19%, 23%

The difference between the third quartile and the first quartile is known as the interquartile range.

To visualize a data set based on quantiles, we can create a box and whisker plot. In a box and whisker
plot, the box represents the central portion of the data, such as the interquartile range.

Page 5 of 8
Organizing, Visualizing, and Describing Data

Box and whisker plot

The following information relates to below two questions

Q. The median is closest to:


A. 34.51.
B. 100.49.
C. 102.98.

Q. The interquartile range is closest to:


A. 13.76.
B. 25.74.
C. 34.51.

LOS j: Calculate and interpret measures of dispersion.


The range is the difference between the largest and smallest values in a data set.

Mean absolute deviation (MAD) is the average of the absolute values of the deviations from the
arithmetic mean.

Variance is defined as the mean of the squared deviations from the arithmetic mean or from the
expected value of a distribution.

Standard deviation is the positive square root of the variance and is frequently used as a quantitative
measure of risk.

Page 6 of 8
Organizing, Visualizing, and Describing Data

Example: Assume the yearly returns of the stock are 30%, 12%, 25%, 20%, and 23%?

The coefficient of variation for sample data is the ratio of the standard deviation of the sample to its
mean (expected value of the underlying distribution).

LOS k: Calculate and interpret target downside deviation.


Target downside deviation or semi-deviation is a measure of downside risk.

Calculating target downside deviation is similar to calculating standard deviation, but in this case, we
choose a target against which to measure each outcome and only include outcomes below that target
when calculating the numerator.

The formula for target downside deviation is:

,
where B is the target value.

LOS l: Interpret skewness.


Skewness describes the degree to which a distribution is not symmetric about its mean. A right-
skewed distribution has positive skewness. A left-skewed distribution has negative skewness.

For a positively skewed, unimodal distribution, the mean is greater than the median, which is greater
than the mode.

For a negatively skewed, unimodal distribution, the mean is less than the median, which is less than
the mode.

Page 7 of 8
Organizing, Visualizing, and Describing Data

LOS m: Interpret kurtosis.


Kurtosis measures the peaked ness of a distribution and the probability of extreme outcomes
(thickness of tails):
 Excess kurtosis is measured relative to a normal distribution, which has a kurtosis of 3.
 Positive values of excess kurtosis indicate a distribution that is leptokurtic (fat tails, more
peaked), so the probability of extreme outcomes is greater than for a normal distribution.
 Negative values of excess kurtosis indicate a platykurtic distribution (thin tails, less peaked).

LOS n: Interpret correlation between two variables.


Correlation is a standardized measure of association between two random variables. It ranges in value
from –1 to +1.

Correlation does not imply that changes in one variable cause changes in the other. Spurious
correlation may result by chance or from the relationships of two variables to a third variable.

Scatterplots are useful for revealing nonlinear relationships that are not measured by correlation.

Page 8 of 8

You might also like