0% found this document useful (0 votes)
121 views

Session 9 and 10 Data Visualization

Data visualization is an important part of descriptive analytics that helps decision makers gain useful insights. There are many chart types like histograms, bar charts, pie charts, and box plots that data scientists use to visualize data. Recently, tree maps and sunburst maps have become popular for creating hierarchical data visualizations. It is generally advisable to start an analytics project with data visualization to understand patterns and relationships in the data.

Uploaded by

Bandita Parida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views

Session 9 and 10 Data Visualization

Data visualization is an important part of descriptive analytics that helps decision makers gain useful insights. There are many chart types like histograms, bar charts, pie charts, and box plots that data scientists use to visualize data. Recently, tree maps and sunburst maps have become popular for creating hierarchical data visualizations. It is generally advisable to start an analytics project with data visualization to understand patterns and relationships in the data.

Uploaded by

Bandita Parida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Data Visualization

BGU
Data Visualization

• Data visualization is an integral part of descriptive analytics and it


assists decision makers with useful insights.
• There are many useful charts such as histogram, bar chart, pie-chart,
box-plot that would assist data scientist with visualization of the data.
• In the recent years, tree maps and sunburst maps are very popular
among analytics experts, which can create hierarchical visuals of data.
• It is always advisable to start an analytics project with data
visualization.
Data Visualization

• Histogram is the visual representation of the data which can be used


to assess the probability distribution (frequency distribution) of the
data.
• It is a frequency distribution of data arranged in consecutive and non-
overlapping intervals
• Histograms are created for continuous (numerical) data.
Data Visualization

• The following steps are used in constructing histograms:


• Divide the data into finite number of non-overlapping and consecutive bins
(intervals).
• The total number of bins to be used can be calculated using Eqs. (2.13) or
(2.14).
• Count the number of observations from the data that fall under each bin
(interval).
• Create a frequency distribution (bin in the horizontal axis and frequency in
the vertical axis) using the information obtained in steps 1 and 2.
Data Visualization

• Histogram is very useful since it assists data scientist to identify the


following:
• The shape of the distribution and to assess the probability distribution
of the data.
• Measures of central tendency such as median and mode.
• Measures of variability such as spread.
• Measure of shape such as skewness
Data Visualization

•  Histograms are also useful in identifying the presence of outliers.


• One of the first steps in constructing histogram is identifying the
number of bins.
• There are many different formulas used in literature and one of the
simplest formula is
• Number of bins, N = ( - ) / W ……………(2.13)
• Here and are the maximum and minimum values of the data and W
is desired the width of the bin (interval).
• Intervals in histograms are usually of equal size.
Data Visualization

• Figures 2.3 and 2.4 show the histogram of Bollywood movie budget in
crores of rupees (1 crore = 10 million) and box-office collection,
respectively, based on the data of 149 Bollywood movies (Data file:
Bollywood Data.Xls)
Fig. 2.3. Histogram of Bollywood movie budget
Fig. 2.4. Histogram of Bollywood movie box-office
collection
Data Visualization

• From Figure 2.3, we can infer that the budget for large proportion of
movies is less than 50 crores and it is a right-skewed distribution (that
is, long tail on the right-hand side).
• In Figure 2.4, we can also see an outlier where the box-office
collection is more than 700 crores (movie PK acted by Amir Khan and
directed by Raj Kumar Hirani).
• The cumulative histograms are called Ogive curves.
• The Ogive curve for Bollywood box-office collection is shown in the
next slide:
Fig. 2.5 Ogive curve for box-office collection
Data Visualization

• Usually, we superimpose normal distribution on the histogram to see


how close the frequency distribution of the data is to a normal
distribution.
• Figure 2.6 shows histogram of movie budget superimposed with
normal distribution;
• It is obvious that the frequency distribution of budget is not a normal
distribution.
Fig. 2.6. Histogram of Bollywood movie budget along with normal
distribution frequency.
Bar Chart

• Bar chart is a frequency chart for qualitative variable (or categorical


variable).
• Histograms cannot be used when the variable is qualitative.
• Bar chart can be used to assess the most-occurring & least-occurring
categories within a data set.
• Figure 2.7 shows the bar chart for the movie genre (Data file:
Bollywood Data.xlsx).
• From the bar chart, it is evident that genres, drama and comedy, are
mostly preferred by the production houses in Bollywood.
Fig. 2.7 Bar chart for movie genre.
Pie Chart

• Pie chart is mainly used for categorical data and is a circular chart that
displays the proportion of each category in the data set.
• The pie chart for the movie genre based on the Bollywood movie data
set is shown in the next slide.
• Pie chart helps to visualize the proportion (percentage) of each
category as sector of a circle.
Fig. 2.8. Pie chart for movie genre.
Scatter Plot

• Scatter plot is a plot of two variables that will assist data scientists to
understand if there is any relationship between two variables.
• The relationship could be linear or non-linear.
• Scatter plot is also useful for assessing the strength of the relationship
and to find if there are any outliers in the data.
• Figure in the next slide shows a scatter plot between the movie
budget and movie box-office collection (in crores of rupees) plotted
using the data set in file Bollywood Data.xlsx.
Fig. 2.9 Scatter plot between movie budget and box-office
collection.
Scatter Plot

• Figure 2.9 shows a linear relationship between budget and box-office


collection and existence of an outlier.

• Scatter plots are useful during regression model building to decide on


the initial model, that is whether to consider a variable in a regression
model or not.
Box Plot (or Box and Whisker Plot)

• Box plot (aka Box and Whisker plot) is a graphical representation of


numerical data that can be used to understand the variability of the
data and the existence of outliers.
• Box plot is designed by identifying the following descriptive statistics:
• Lower quartile (1st Quartile), median and upper quartile (3rd
Quartile).
• Lowest and highest value.
• Inter-quartile range (IQR).
Box Plot (or Box and Whisker Plot)

• The box plot is constructed using IQR, minimum and maximum values.
• The box plot for the data in Table 2.4 is shown in Figure 2.11.
Box Plot (or Box and Whisker Plot)
Box Plot (or Box and Whisker Plot)

• The length of the box is equivalent to IQR.


• It is possible that the data may contain values beyond Q1 – 1.5 IQR
and Q3 + 1.5 IQR.
• The whisker of the box plot extends till Q1 – 1.5 IQR (or minimum
value) and Q3 + 1.5 IQR (or maximum value); observations beyond
these two limits are potential outliers.
• The box plot for the Bollywood movie budget is shown in Figure 2.12.
Box Plot (or Box and Whisker Plot)

• In Figure 2.12 position of the lowest whisker is 2 (since that is the


minimum value).
• The value of lower quartile is 11 (lower line of the box), median is 24
(middle line in the box), and top quartile is 35 (upper line of the box).
• The top whisker is at Q3 + 1.5 IQR = 71.
• All the observations beyond Q3 + 1.5 IQR shown above the upper
whisker are outliers.
Treemap

• Treemap is a hierarchical map made up of nested rectangles


frequently used as part of business intelligence reports which helps
organizations to understand the data hierarchically.
• To construct a treemap, the data should be hierarchical with several
levels.
• The size of rectangle and color are used for describing/differentiating
the characteristics of the data.
Treemap

• A sample Treemap is shown in Figure 2.13 in which the academic


discipline at undergraduate level of students admitted into an MBA
program is captured.
• Size of the area captures proportion of the students from that
discipline.
• That is, in Figure 2.13 the area corresponding to engineering students
is the largest indicating that the largest proportion of students come
from engineering background.
Treemap

• Area corresponding to Arts students is the least indicating least


number of students with arts background in the MBA program.
• Each of the disciplines can be further analyzed.
• For example, the engineering students can be further grouped
according to the type of college (Tier 1, Tier 2, etc.).
Coxcomb Chart

• Coxcomb chart (also known as polar area chart or roses) is an


extension of pie chart made popular by Florence Nightingale.
• In a Coxcomb chart, each area represents the magnitude of the
category.
• The main difference between the regular pie chart and coxcomb chart
is that in the case of pie chart the radius of each sector is same,
whereas, in coxcomb chart the radius of the sector is adjusted to
create the magnitude of the area.
Coxcomb Chart

• Florence Nightingale collected data from Crimean war (war between


British and French on one side and Russians on the other side) on
causes of mortality among soldiers.
• She classified the causes into three categories:
• Preventable diseases
• Wounds sustained in the war
• Other causes
• In Figure 2.10 (originally prepared by Florence Nightingale), the largest
area of the chart corresponds to the cause ‘preventable diseases’.
Fig. 2.10. Coxcomb chart on causes of mortality in the
army prepared by Florence Nightingale.
Fig. 2.10. Coxcomb chart on causes of mortality in the
army prepared by Florence Nightingale.

You might also like