0% found this document useful (0 votes)
369 views

Data Wrangling and Visualization

Data Wrangling and Visualization

Uploaded by

Ysa Antonio
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
369 views

Data Wrangling and Visualization

Data Wrangling and Visualization

Uploaded by

Ysa Antonio
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

DATA WRANGLING AND

DATA VISUALIZATION
Learning Objectives
At the end of this module, learners are expected to:
1. Visualize categorical and numerical variables.
2. Construct and interpret a summary table, bar chart, and pie chart
for categorical variables.
3. Construct and interpret a contingency table and a stacked bar chart
for two categorical variables.
4. Construct and interpret a scatterplot, a bubble plot, histogram, and
a line chart with numerical variables.
5. Calculate and interpret summary measures (Descriptive measures).
6. Use boxplots and z-scores to identify outliers.
The 3 Stages of Business Analytics (Jaggia et al, 2021)

-Optimization
-Simulation
-Regression
-Supervised data mining
-Forecasting

-Data wrangling
-Data visualization
-Unsupervised data mining
Unsupervised Data Mining (Jaggia et al, 2021)
• Unsupervised Data mining techniques is a
clustering method used in data mining.
• It aims to search for patterns and structure
among all the variables.
• Clustering is probably the most common
unsupervised method.
• Clustering is also known as “segmentation” in
the marketing circles.
• Clustering aims to group entities (customers,
companies, cities, or what ever) into similar
clusters based on the values of their
variables.
Business Analytics by Albright and Winston, p.17
Unsupervised Data Mining (Jaggia et al, 2021)

• You group the data according to their


common properties or characteristics
• you don’t have any idea about the data
you are trying to investigate.
• You don’t have
predefined/predetermined objective
function nor predict any value.
• There is no expected outcome to
classify, you will just naturally
investigate occurrences or trends
Supervised Data Mining (Jaggia et al, 2021)
• Supervised Data Mining is a data mining technique that orients or instructs the
machine about the expected output.
• Here, analysts of programmer train the machine using labeled data on how to classify
or predict possible outcome based on stated condition.
• Supervised Data Mining techniques include classification models where the target
variable is categorical and prediction models where the target variable is numerical.
Supervised Data Mining (Jaggia et al, 2021)
• Labeled data means it is already tagged with the right answer.
• That’s why it is called supervised – because there is a teacher or supervisor
that orients the machine on how to classify or predict possible outcome
based on stated condition.
• Examples of classification models for supervised data mining include
predicting whether or not consumer will make a purchase, a mortgage will
be approved, a patient will a certain illness, and an e-mail will be a spam.
• Similarly, examples of prediction include predicting the sale price of a house,
the salary of a business school graduate, the total sales of a firm, and a debt
payment of a consumer.
The 3 Stages of Business Analytics (Jaggia & Kelly, 2021)
Descriptive Analytics – refers to
gathering, organizing, tabulating
-Optimization
and visualizing data to
-Simulation summarize “what has
-Regression
-Supervised data mining
happened”. Examples include:
-Forecasting Financial reports, public health
-Data wrangling statistics, enrollment at
-Data visualization
-Unsupervised data mining universities, students report
cards and crime rates across
regions and time.
The 3 stages of Business Analytics (Jaggia & Kelly, 2021)
Data wrangling is the process
of retrieving, cleansing,
-Optimization
integrating, transforming and
-Simulation enriching data to support
-Regression analytics.
-Supervised data mining
-Forecasting The key tasks during data
-Data wrangling wrangling process are data
-Data visualization
-Unsupervised data mining
management, data
inspection, data preparation,
and data transformation.
Key tasks in Data Wrangling
1. Data Management – is a process that an organization used to acquire,
organize, store, manipulate, and distribute data.
• A database is a collection of data logically organized to enable easy
retrieval, management, and distribution of data. The most common
type used by organizations is the relational database. A relational
database consists of one or more logically related data files, where each
data file is a two-dimensional grid that consists of rows and columns.
• Database Management System (DBMS) is a software application for
defining, manipulating, and managing data in databases. Popular DBMS
packages include Oracle, IBM DB2, SQP Server, MySQL and Microsoft
Access.
Key tasks in Data Wrangling
2. Data Inspection
Once the raw data are extracted from the database (data warehouse or
data mart), they must be reviewed and inspected to assess data quality and
relevant information for subsequent analysis.
In addition to visually reviewing data, counting and sorting are among the
very first tasks most data analysts perform to gain a better understanding
and insights into the data. Counting help us verify that the data set is
complete or that it may have missing values, especially for important
variables while Sorting allows us to review the range of values for each
variable.
The Excel
Diagram

Welcome to Exce
l
Websites to visit:
Excel video training - Office Support (microsoft.com)

My Excel Power
Key tasks in Data Wrangling
2. Data Inspection:

A. Sorting using Excel


(https://www.excel-easy.com/data-analysis/sort.html)

B. Counting using Excel


(https://www.excel-easy.com/functions/count-sum-functions.html)
Key tasks in Data Wrangling
3. Data Preparation
A. Handling missing values. There are two common strategies to handle
observations with missing values: omission and imputation.
The omission strategy, also called complete-case analysis,
recommends that observations with missing values be excluded from the
analysis. This is appropriate when there are a few missing values in the data
set.
The imputation strategy replaces missing values with some
reasonable imputed values like the mean value across relevant observations.

• For numerical variables, it is common to use mean imputation, provided


there are no extremes value or dataset is not skewed.
Key tasks in Data Wrangling
3. Data Preparation (Con’t)
• If there are extreme values, either you removed that extreme value or
use the median imputation.
• For categorical variables, it is common to impute the most predominant
category called mode imputation.
Key tasks in Data Wrangling
3. Data Preparation
B. Subsetting. It is the process of extracting parts of a data set that is of
interest to the analytics professional. It can also be performed as part of
descriptive analytics that helps reveal insights in the data.
• To pre-process the data prior to analysis
• To eliminate unwanted data in a set of observations
• Remove variables with excessive amount of missing values
• To help reveal insights in the data
Subsetting can be easily performed in Excel using “Filter”.
(How to Filter in Excel - Easy Excel Tutorial)
Key tasks in Data Wrangling
4. Data Transformation
Data transformation is the data conversion process from one format of
structure to another. It is performed to meet the requirements of statistical
and data mining techniques used for the analysis.
Examples of transforming numerical data include transforming an
individual’s date of birth to age, combing height and weight to create body
mass index, calculating percentages or converting values to natural
logarithms.
Key tasks in Data Wrangling
4. Data Transformation
A. Binning – is a common process of transforming numerical variables into
categorical variables by grouping the numerical values into a smaller
number of groups or bins.
B. Mathematical transformations (some examples)
1. creation of new variables by applying mathematical transformation of
existing variables
2. natural logarithm transformation and square root transformation to
reduce skewness of data
3. data rescaling using standardization or normalization.
The 3 stages of Business Analytics (Jaggia & Kelly, 2021)
Data Visualization is
the process of
-Optimization
-Simulation
presenting data using
-Regression
-Supervised data mining
tabular and graphical
-Forecasting
tools as well as
-Data wrangling
-Data visualization summary measures
-Unsupervised data mining
that help us organize
and present data.
Methods to visualize Categorical Variables
A. Summary tables for categorical
variables
A summary table for a categorical
variable groups the data into categories and
records the number of observations that fall
into each category. The relative frequency
of each category equals the proportion of
observations in each category.
(https://www.excel-easy.com/data-
analysis/pivot-tables.html)
Methods to visualize Categorical Variables
Interpretation: Table 1.
Mode of transportation of students of a
Table 1 reveals that the most urban university
common commuting mode of
students at a certain university is
public transportation, n = 273.
Walking and bicycling are the next
most common commuting modes,
with n = 141 and n = 111,
respectively. Furthermore, 18 of the
students utilize other modes of
transportation.
Methods to visualize Categorical Variables

B. Bar Chart for categorical


variables
A bar chart depicts the frequency or the
relative frequency for each category of
the variable as a series of horizontal or
vertical bars, the lengths of which are
proportional to the values that are to be
depicted.
(https://www.excel-easy.com/data-
analysis/charts.html)
Methods to visualize Categorical Variables

C. Pie Chart for Categorical


Variables
A pie chart depicts the frequency or the
relative frequency for each category of
the variable as slices of a pie, the sector of
which are proportional to the values that
are to be depicted.
(https://www.excel-easy.com/data-
analysis/charts.html)
Methods to visualize Categorical Variables
Figure 1.
Interpretation
Location of some Philippine-based SMEs
Figure 1 reveals that some
of Philippine-based SMEs are
commonly located in NCR, with
28%, and Region 3, also with
28%. The next most commonly
location is Region 4A, with 23%.
Methods to visualize Numerical Variables
A. Frequency distribution table for
numerical variables
For a numerical variable, a frequency
distribution groups data into intervals and
records the number of observations that falls
into each interval. The relative frequency for
each interval equals the proportion of
observations in each interval.
(https://www.excel-easy.com/examples/histo
gram.html)
Methods to visualize Numerical Variables
Interpretation: Table 2.

As shown in Table 2, the data looks Frequency distribution for Growth

more organize although some detail is lost


since we can no longer see the actual
observations.
From the table, we can observe that
the most likely return for the Growth
variable is between 0% and 25%, with
48.57% of the observations. No
observations fall between 50% and 75%
and only one observation falls between
75% and 100%.
Methods to visualize Numerical Variables
B. Histogram for numerical
variables
A histogram is a series of rectangles where
the width and height of each rectangle
represent the interval width and frequency
(or relative frequency) of the respective
interval.
(
https://www.excel-easy.com/data-analysis/c
harts.html
)
Methods to visualize Numerical Variables
C. Line chart for numerical
variables
Line charts are used to display trends
over time. Use a line chart if you have
text labels, dates or a few numeric labels
on the horizontal axis.
(https://www.excel-easy.com/
examples/line-chart.html) We observe from this figure that the population
of bears has grown consistently from 2017 to
2022.
Methods to Visualize the Relationship Between Two
Categorical Variables
A. Contingency tables
Table 3.

Contingency table for location and purchase example


A contingency table shows the
frequencies for two categorical
variables, x and y, where each cell
represents a mutually exclusive
combination of the pair of x and y.
( From Table 3, it can be noted that of the 600 email
How to Create a Contingency Table in Ex recipients, 410 of them made a purchase using the
cel – promotional s=discount. However, there appears to be some
differences depending on location, recipients residing in the
Statology) South (130 out of 154) and West (101 out of 119) were a lot
more likely compared to those in the Midwest (77 out of 184)
and Northeast (102 out of 143).
Methods to Visualize the Relationship Between Two
Categorical Variables
B. Stacked Column Chart
The information in a contingency table
can be shown graphically using a stacked
column chart. Highlight the cells of the
contingency table, choose Insert, Insert
column or bar chart, stacked column.
It is designed to visualize more than one
categorical variable plus it allows the
comparison of composition with each
category. https://www.youtube.com/watch?v=0zjbF9rTHA4
Methods to Visualize the Relationship Between Two
Numerical Variables
Figure 2.
A. Scatterplot Scatterplot between growth and mutual funds value
A scatterplot is a graphical tool that
help in determining whether or not two
numerical variables are related in some
systematic way. Each point in a scatter
plot represents a paired observation for
the two variables.
From Figure 2, we can infer that there is a positive
relationship between Growth and Mutual fund values,
that is, as the annual return for growth increases, the
annual return for mutual fund value tends to increase as
well.
Methods to Visualize the Relationship Between Three
Numerical Variables
B. Bubble plot
A bubble plot shows the pattern of
relationship between three numerical
variables. The third numerical variable is
represented by the size of the bubble.
For example, plot life expectancy against
birth rate and use the size of the bubble
to represent the countries’ GNP to
enable one to understand the
relationship between the 3 variables. How to quickly create bubble chart in Excel? (e
xtendoffice.com)
Methods to Visualize the Relationship Between two
numerical variables
Interpretation Figure 3. A bubble plot of life expectancy, birth rate and GNP

We observe from Figure 3 that a


country’s average life expectancy and
birth rate display a negative
relationship. We also see that
countries with low birth rates and
high life expectancies have higher
GNP.
OTHER
GRAPHIC
AL
DISPLAY
S
SUMMARY MEASURES
SUMMAR
Y
MEASUR
ES
SUMMARY MEASURES
SUMMARY MEASURES
Excel and R
Function
names
Outliers
The boxplots on the
left illustrates several
to many outliers on
each dataset which
may tend to have
significant effect on
the normality of the
data set if not treated
well.

https://github.com/ropensci/plotly/issues/1114
Using the z-scores to check for Outliers:
1. Determine whether 75 is an outlier in a given
distribution with a mean of 60 and standard
deviation is 10.
Solution:
Find the z-score that corresponds to X = 75.
x   75  60
z   1.5
 10
X = 75 is not an outlier since it is within the
acceptable interval [-3, 3].
Using the z-scores to check for Outliers:

2. Given that the average hourly rate in a certain stock is


4.5% with a standard deviation of 1.75%, assuming that
data is normally distributed, tell whether each of the
following hourly rate is normally distributed:
a. 6.23% b. 3.89%
Solution:
Find the z-score that corresponds to;
a. X = 6.23% or 0.0623 6.23% is an outlier since it is
x   0.0623  0.045 outside the acceptable interval
z   3.27 [-3, 3].
 0.0175
Using the z-scores to check for Outliers:
2. b. 3.89%
Solution:
Find the z-score that corresponds to;
b. X = 3.89% or 0.0389
x   0.0389  0.045
z   0.35
 0.0175

3.89% is not an outlier since it is within the


acceptable interval [-3, 3].
Using the z-scores to check for Outliers:
3. Soundbar Sales
Christian and Alex work in a multimedia store where they sell soundbars.
For proper inventory and monitoring they recorded the number of sales they
made each month. In the past 12 months, they sold the following numbers
of soundbars:
Alex : 34, 47, 1, 15, 57, 24, 20, 11, 19, 50, 28, 37.
Christian: 51, 17, 25, 39, 7, 49, 62, 41, 20, 6, 43, 13.
Using the z-scores method, find out if 57 is an outlier for the sales of
Alex and find out if 6 is an outlier for the sales of Christian (Hint: Use JASP
to solve for sample mean & SD for each first) then briefly interpret your
answers/solutions.
Reference
Definitions, selected tables and images were lifted from:

Business Analytics: Communicating with Numbers by Jaggia, S., Kelly,


A., Lertwachara, K. and Chen, L.
Copyright 2021 by McGraw-Hill Education.

Business Analytics: Data Analysis and Decision Making by Albright, S.C.


and Winston, W.L.
Copyright 2020 by Cengage Learning.

You might also like