Organizing, Visualizing and Describing Data
Organizing, Visualizing and Describing Data
Numerical, or quantitative, data are values that can be counted or measured and may be discrete or
continuous. Categorical, or qualitative, data are labels that can be used to classify a set of data into
groups and may be nominal or ordinal.
A time series is a set of observations taken at a sequence of points in time. Cross sectional data are a
set of comparable observations taken at one point in time. Time series and cross-sectional data may
be combined to form panel data.
Unstructured data refers to information that is presented in forms that are not regularly structured
and may be generated by individuals, business processes, or sensors.
Cumulative relative frequency for an interval is the sum of the relative frequencies for all values less
than or equal to that interval’s maximum value.
LOS e: Describe ways that data may be visualized and evaluate uses of specific visualization.
A histogram is a bar chart of data that has been grouped into a frequency distribution.
Organizing, Visualizing, and Describing Data
A frequency polygon plots the midpoint of each interval on the horizontal axis and the absolute
frequency for that interval on the vertical axis, and it connects the midpoints with straight lines.
A cumulative frequency distribution chart is a line chart of the cumulative absolute frequency or the
cumulative relative frequency.
Page 2 of 8
Organizing, Visualizing, and Describing Data
In a stacked bar chart, the height of each bar represents the cumulative frequency for a category, and
the colors within each bar represent joint frequencies.
A tree map is another method for visualizing the relative sizes of categories.
A word cloud is generated by counting the uses of specific words in a text file. It displays the words
that appear most often, in type sizes that are scaled to the frequency of their use.
Page 3 of 8
Organizing, Visualizing, and Describing Data
Multiple time series can be displayed on a line chart if their scales are comparable. It is also possible
to display two time series on a line chart if their scales are different by using left and right vertical
axes.
A technique for adding a dimension to a line chart is to create a bubble line chart.
A scatter plot is a way of displaying how two variables tend to change together. The vertical axis
represents one variable and the horizontal axis represents a second variable. Each point in the scatter
plot shows the values of both variables at one specific point in time.
To analyze three variables at the same time, an analyst can create a scatter plot matrix that consists
of three scatter plots of these variables, each presenting two of the three variables.
Page 4 of 8
Organizing, Visualizing, and Describing Data
The geometric mean is used to calculate or estimate periodic compound returns over multiple periods.
The harmonic mean can be used to find an average purchase price, such as dollars per share for equal
periodic investments.
The median is the midpoint of a data set when the data are arranged from largest to smallest.
The mode of a data set is the value that occurs most frequently.
Example: What is the third quartile for the following distribution of returns?
8%, 10%, 12%, 13%, 15%, 17%, 17%, 18%, 19%, 23%
The difference between the third quartile and the first quartile is known as the interquartile range.
To visualize a data set based on quantiles, we can create a box and whisker plot. In a box and whisker
plot, the box represents the central portion of the data, such as the interquartile range.
Page 5 of 8
Organizing, Visualizing, and Describing Data
Mean absolute deviation (MAD) is the average of the absolute values of the deviations from the
arithmetic mean.
Variance is defined as the mean of the squared deviations from the arithmetic mean or from the
expected value of a distribution.
Standard deviation is the positive square root of the variance and is frequently used as a quantitative
measure of risk.
Page 6 of 8
Organizing, Visualizing, and Describing Data
Example: Assume the yearly returns of the stock are 30%, 12%, 25%, 20%, and 23%?
The coefficient of variation for sample data is the ratio of the standard deviation of the sample to its
mean (expected value of the underlying distribution).
Calculating target downside deviation is similar to calculating standard deviation, but in this case, we
choose a target against which to measure each outcome and only include outcomes below that target
when calculating the numerator.
,
where B is the target value.
For a positively skewed, unimodal distribution, the mean is greater than the median, which is greater
than the mode.
For a negatively skewed, unimodal distribution, the mean is less than the median, which is less than
the mode.
Page 7 of 8
Organizing, Visualizing, and Describing Data
Correlation does not imply that changes in one variable cause changes in the other. Spurious
correlation may result by chance or from the relationships of two variables to a third variable.
Scatterplots are useful for revealing nonlinear relationships that are not measured by correlation.
Page 8 of 8