Module 4
Module 4
“Data visualization is the visual representation of your data. With the help of charts,
maps, and other graphical elements these tools provide a simple and comprehensible
way to clearly see and easily discover insights and patterns in your data.”
Why do we need data visualization?
With the help of descriptive graphics and dashboards, even difficult information can
be clear and comprehensible.
Here are some noteworthy numbers, based on research, that confirm the importance
of visualization:
• People get 90% of information about their environment from the eyes.
• 50% of brain neurons take part in visual data processing.
• Pictures increase the wish to read a text up to 80%.
• People remember 10% of what they hear, 20% of what they read, and 80% of
what they see.
Advantages
Relevant visualization brings lots of advantages for your business:
• Fast decision-making. Summing up data is easy and fast with graphics, which let
you quickly see that a column or touchpoint is higher than others without looking
through several pages of statistics in Google Sheets or Excel.
• More people involved. Most people are better at perceiving and remembering
information presented visually.
• Higher degree of involvement. Beautiful and bright graphics with clear messages
attract readers’ attention.
• Better understanding. Perfect reports are transparent not only for technical
specialists, analysts, and data scientists but also for CMOs and CEOs, and help
each and every worker make decisions in their area of responsibility.
Common general types of data visualization:
Charts
Tables
Graphs
Maps
Infographics
Dashboards
More specific examples of methods to visualize data:
Area Chart
Bar Chart
Box-and-whisker Plots
Scatter plot
Charts
The easiest way to show the development of one or several data sets is a chart.
Charts vary from bar and line charts that show the relationship between elements
over time to pie charts that demonstrate the components or proportions between the
elements of one whole.
Plots
Plots allow to distribute two or more data sets over a 2D or even 3D space to show
the relationship between these sets and the parameters on the plot. Plots also vary.
Scatter and bubble plots are some of the most widely-used visualizations. When it
comes to big data, analysts often use more complex box plots that help visualize the
relationship between large volumes of data.
Histogram
A histogram is a graphical representation that organizes a group of data points into
user-specified ranges. Similar in appearance to a bar graph, the histogram condenses
a data series into an easily interpreted visual by taking many data points and
grouping them into logical ranges or bins.
Table
Table (or a data table) is an efficient format for comparative data analysis on
categorical objects. Usually, the items being compared are placed in a column,
while the categorical objects are in the rows. The quantitative value is then placed at
the intersection of the row and column, called the cell.
Frequency distribution
A frequency distribution is an overview of all distinct values in some variable and
the number of times they occur. That is, a frequency distribution tells how
frequencies are distributed over values. Frequency distributions are mostly used for
summarizing categorical variables.
Central tendency and dispersion
Measures of central tendency map a vector of observations onto a single number
that represents, roughly put, “the center”.
Since what counts as a “center” is ambiguous, there are several measures of central
tendencies.
Different measures of central tendencies can be more or less adequate for one
purpose or another.
The type of variable (nominal, ordinal or metric, for instance) will also influence the
choice of measure.
We will visit three prominent measures of central tendency here: (arithmetic)
mean, median and mode.
Measures of dispersion indicate how much the observations are spread out around,
let’s say, “a center”. We will visit three prominent measures of dispersion:
the variance and the standard deviation .
Central tendency and dispersion
Central tendency and dispersion
• The (arithmetic) mean
The median
• If →x=⟨x1,…,xn⟩ is a vector of n data observations from an at least ordinal measure and
if →x→ is ordered such that for all 1≤ i < n we have xi ≤ xi+1, the median is the value xi such
that the number of data observations that are bigger or equal to xi and the number of data
observations that are smaller or equal to xi are equal.
Central tendency and dispersion
The mode
• The mode is the unique value that occurred most frequently in the data. If there is
no unique value with that property, there is no mode.
• While the mean is only applicable to metric variables, and the median only to
variables that are at least ordinal, the mode is only reasonable for variables that
have a finite set of different possible observations.
•
•
Probabilities and outcomes
Your grade on an exam, and the number of times your internet connection fails
while you are writing a term paper all have an element of chance or randomness. In
each of these examples, there is something not yet known that is eventually
revealed.
The mutually exclusive potential results of a random process are called the
outcomes. For example, while writing your term paper, the internet connection
might never fail, it might fail once, it might fail twice, and so on. Only one of these
outcomes will actually occur (the outcomes are mutually exclusive)
The probability of an outcome is the proportion of the time that the outcome occurs
in the long run. (Trail)
If the probability of your internet connection not failing while you are writing a
term paper is 80%, then over the course of writing many term papers, you will
complete 80% without a wireless connection failure.
Probabilities and outcomes
The sample space and events.
The set of all possible outcomes is called the sample space. An event is a subset of
the sample space; that is, an event is a set of one or more outcomes.
The event “my internet connection will fail no more than once” is the set consisting
of two outcomes: “no failures” and “one failure.”
Random variables
A random variable is a numerical summary of a random outcome. The number of
times your internet connection fails while you are writing a term paper is random
and takes on a numerical value, so it is a random variable.
Some random variables are discrete and some are continuous. As their names
suggest, a discrete random variable takes on only a discrete set of values, like 0, 1,
2, . . . , whereas a continuous random variable takes on a continuum of possible
values.
Probability Distributions
A probability distribution is a statistical function that describes all the possible
values and likelihoods that a random variable can take within a given range.
Probability Distribution of a Discrete Random Variable
The probability distribution of a discrete random variable is the list of all possible
values of the variable and the probability that each value will occur. These
probabilities sum to 1.
For example, let M be the number of times your internet network connection
fails while you are writing a term paper. The probability distribution of the random
variable M is the list of probabilities of all possible outcomes: The probability that
M = 0, denoted Pr (M = 0), is the probability of no wireless connection failures;
Pr (M = 1) is the probability of a single connection failure; and so forth.
•
Probability Distributions
•
Probability Distributions
Cumulative probability distribution. of a Continuous Random Variable
Probability Distributions
Probability Distribution of a Continuous Random Variable
Probability density function.
Because a continuous random variable can take on a continuum of possible values,
the probability distribution used for discrete variables, which lists the probability of
each possible value of the random variable, is not suitable for continuous variables.
Instead, the probability is summarized by the probability density function. The area
under the probability density function between any two points is the probability that
the random variable falls between those two points. A probability density function is
also called a p.d.f., a density function, or simply a density.
Probability Distributions
Probability density function of a Continuous Random Variable
Probability Distributions
Joint distribution.
The joint probability distribution of two discrete random variables, say X and Y, is the probability
that the random variables simultaneously take on certain values, say x and y. The probabilities of all
possible (x, y) combinations sum to 1.
The joint probability distribution can be written as the function Pr(X = x, Y = y)
For example, weather conditions—whether or not it is raining—affect the commuting time of the
student commute.
Let Y be a binary random variable that equals 1 if the commute is short (less than 20 minutes) and
that equals 0 otherwise, and let X be a binary random variable that equals 0 if it is raining and 1 if
not.
Between these two random variables, there are four possible outcomes: it rains and the commute is
long (X = 0, Y = 0); rain and short commute (X = 0, Y = 1); no rain and long commute (X = 1, Y =
0); and no rain and short commute (X = 1, Y = 1). The joint probability distribution is the frequency
with which each of these four outcomes occurs over many repeated commutes.
Probability Distributions
Joint distribution.
•
Probability Distributions
Joint distribution.
For example, the probability of a long rainy commute is 15%, and the probability of
a long commute with no rain is 7%, so the probability of a long commute (rainy or
not) is 22%. The marginal distribution of commuting times is given in the final
column of Table 2.2. Similarly, the marginal probability that it will rain is 30%, as
shown in the final row
Probability Distributions
Conditional Distributions
Conditional distribution. The distribution of a random variable Y conditional on
another random variable X taking on a specific value is called the conditional
Distribution of Y given X. The conditional probability that Y takes on the value y
when X takes on the value x is written Pr(Y = y | X = x).
• Pr (Y = y |X = x) = Pr (X = x, Y = y)/ Pr (X = x)
• Probability of long commute given a rainy day
• Pr (Y = LC |X = R) = Pr (X = R, Y = LC)/ Pr (X = R)
• =0.15/.3=0.5