Data Wrangling and Visualization

Uploaded by

Ysa Antonio

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

369 views

Data Wrangling and Visualization

Uploaded by

Ysa Antonio

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 48

DATA WRANGLING AND

DATA VISUALIZATION
Learning Objectives
At the end of this module, learners are expected to:
1. Visualize categorical and numerical variables.
2. Construct and interpret a summary table, bar chart, and pie chart
for categorical variables.
3. Construct and interpret a contingency table and a stacked bar chart
for two categorical variables.
4. Construct and interpret a scatterplot, a bubble plot, histogram, and
a line chart with numerical variables.
5. Calculate and interpret summary measures (Descriptive measures).
6. Use boxplots and z-scores to identify outliers.
The 3 Stages of Business Analytics (Jaggia et al, 2021)

-Optimization
-Simulation
-Regression
-Supervised data mining
-Forecasting

-Data wrangling
-Data visualization
-Unsupervised data mining
Unsupervised Data Mining (Jaggia et al, 2021)
• Unsupervised Data mining techniques is a
clustering method used in data mining.
• It aims to search for patterns and structure
among all the variables.
• Clustering is probably the most common
unsupervised method.
• Clustering is also known as “segmentation” in
the marketing circles.
• Clustering aims to group entities (customers,
companies, cities, or what ever) into similar
clusters based on the values of their
variables.
Business Analytics by Albright and Winston, p.17
Unsupervised Data Mining (Jaggia et al, 2021)

• You group the data according to their

common properties or characteristics
• you don’t have any idea about the data
you are trying to investigate.
• You don’t have
predefined/predetermined objective
function nor predict any value.
• There is no expected outcome to
classify, you will just naturally
investigate occurrences or trends
Supervised Data Mining (Jaggia et al, 2021)
• Supervised Data Mining is a data mining technique that orients or instructs the
machine about the expected output.
• Here, analysts of programmer train the machine using labeled data on how to classify
or predict possible outcome based on stated condition.
• Supervised Data Mining techniques include classification models where the target
variable is categorical and prediction models where the target variable is numerical.
Supervised Data Mining (Jaggia et al, 2021)
• Labeled data means it is already tagged with the right answer.
• That’s why it is called supervised – because there is a teacher or supervisor
that orients the machine on how to classify or predict possible outcome
based on stated condition.
• Examples of classification models for supervised data mining include
predicting whether or not consumer will make a purchase, a mortgage will
be approved, a patient will a certain illness, and an e-mail will be a spam.
• Similarly, examples of prediction include predicting the sale price of a house,
the salary of a business school graduate, the total sales of a firm, and a debt
payment of a consumer.
The 3 Stages of Business Analytics (Jaggia & Kelly, 2021)
Descriptive Analytics – refers to
gathering, organizing, tabulating
-Optimization
and visualizing data to
-Simulation summarize “what has
-Regression
-Supervised data mining
happened”. Examples include:
-Forecasting Financial reports, public health
-Data wrangling statistics, enrollment at
-Data visualization
-Unsupervised data mining universities, students report
cards and crime rates across
regions and time.
The 3 stages of Business Analytics (Jaggia & Kelly, 2021)
Data wrangling is the process
of retrieving, cleansing,
-Optimization
integrating, transforming and
-Simulation enriching data to support
-Regression analytics.
-Supervised data mining
-Forecasting The key tasks during data
-Data wrangling wrangling process are data
-Data visualization
-Unsupervised data mining
management, data
inspection, data preparation,
and data transformation.
Key tasks in Data Wrangling
1. Data Management – is a process that an organization used to acquire,
organize, store, manipulate, and distribute data.
• A database is a collection of data logically organized to enable easy
retrieval, management, and distribution of data. The most common
type used by organizations is the relational database. A relational
database consists of one or more logically related data files, where each
data file is a two-dimensional grid that consists of rows and columns.
• Database Management System (DBMS) is a software application for
defining, manipulating, and managing data in databases. Popular DBMS
packages include Oracle, IBM DB2, SQP Server, MySQL and Microsoft
Access.
Key tasks in Data Wrangling
2. Data Inspection
Once the raw data are extracted from the database (data warehouse or
data mart), they must be reviewed and inspected to assess data quality and
relevant information for subsequent analysis.
In addition to visually reviewing data, counting and sorting are among the
very first tasks most data analysts perform to gain a better understanding
and insights into the data. Counting help us verify that the data set is
complete or that it may have missing values, especially for important
variables while Sorting allows us to review the range of values for each
variable.
The Excel
Diagram

Welcome to Exce
l
Websites to visit:
Excel video training - Office Support (microsoft.com)

My Excel Power
Key tasks in Data Wrangling
2. Data Inspection:

A. Sorting using Excel

(https://www.excel-easy.com/data-analysis/sort.html)

B. Counting using Excel

(https://www.excel-easy.com/functions/count-sum-functions.html)
Key tasks in Data Wrangling
3. Data Preparation
A. Handling missing values. There are two common strategies to handle
observations with missing values: omission and imputation.
The omission strategy, also called complete-case analysis,
recommends that observations with missing values be excluded from the
analysis. This is appropriate when there are a few missing values in the data
set.
The imputation strategy replaces missing values with some
reasonable imputed values like the mean value across relevant observations.

• For numerical variables, it is common to use mean imputation, provided

there are no extremes value or dataset is not skewed.
Key tasks in Data Wrangling
3. Data Preparation (Con’t)
• If there are extreme values, either you removed that extreme value or
use the median imputation.
• For categorical variables, it is common to impute the most predominant
category called mode imputation.
Key tasks in Data Wrangling
3. Data Preparation
B. Subsetting. It is the process of extracting parts of a data set that is of
interest to the analytics professional. It can also be performed as part of
descriptive analytics that helps reveal insights in the data.
• To pre-process the data prior to analysis
• To eliminate unwanted data in a set of observations
• Remove variables with excessive amount of missing values
• To help reveal insights in the data
Subsetting can be easily performed in Excel using “Filter”.
(How to Filter in Excel - Easy Excel Tutorial)
Key tasks in Data Wrangling
4. Data Transformation
Data transformation is the data conversion process from one format of
structure to another. It is performed to meet the requirements of statistical
and data mining techniques used for the analysis.
Examples of transforming numerical data include transforming an
individual’s date of birth to age, combing height and weight to create body
mass index, calculating percentages or converting values to natural
logarithms.
Key tasks in Data Wrangling
4. Data Transformation
A. Binning – is a common process of transforming numerical variables into
categorical variables by grouping the numerical values into a smaller
number of groups or bins.
B. Mathematical transformations (some examples)
1. creation of new variables by applying mathematical transformation of
existing variables
2. natural logarithm transformation and square root transformation to
reduce skewness of data
3. data rescaling using standardization or normalization.
The 3 stages of Business Analytics (Jaggia & Kelly, 2021)
Data Visualization is
the process of
-Optimization
-Simulation
presenting data using
-Regression
-Supervised data mining
tabular and graphical
-Forecasting
tools as well as
-Data wrangling
-Data visualization summary measures
-Unsupervised data mining
that help us organize
and present data.
Methods to visualize Categorical Variables
A. Summary tables for categorical
variables
A summary table for a categorical
variable groups the data into categories and
records the number of observations that fall
into each category. The relative frequency
of each category equals the proportion of
observations in each category.
(https://www.excel-easy.com/data-
analysis/pivot-tables.html)
Methods to visualize Categorical Variables
Interpretation: Table 1.
Mode of transportation of students of a
Table 1 reveals that the most urban university
common commuting mode of
students at a certain university is
public transportation, n = 273.
Walking and bicycling are the next
most common commuting modes,
with n = 141 and n = 111,
respectively. Furthermore, 18 of the
students utilize other modes of
transportation.
Methods to visualize Categorical Variables

B. Bar Chart for categorical

variables
A bar chart depicts the frequency or the
relative frequency for each category of
the variable as a series of horizontal or
vertical bars, the lengths of which are
proportional to the values that are to be
depicted.
(https://www.excel-easy.com/data-
analysis/charts.html)
Methods to visualize Categorical Variables

C. Pie Chart for Categorical

Variables
A pie chart depicts the frequency or the
relative frequency for each category of
the variable as slices of a pie, the sector of
which are proportional to the values that
are to be depicted.
(https://www.excel-easy.com/data-
analysis/charts.html)
Methods to visualize Categorical Variables
Figure 1.
Interpretation
Location of some Philippine-based SMEs
Figure 1 reveals that some
of Philippine-based SMEs are
commonly located in NCR, with
28%, and Region 3, also with
28%. The next most commonly
location is Region 4A, with 23%.
Methods to visualize Numerical Variables
A. Frequency distribution table for
numerical variables
For a numerical variable, a frequency
distribution groups data into intervals and
records the number of observations that falls
into each interval. The relative frequency for
each interval equals the proportion of
observations in each interval.
(https://www.excel-easy.com/examples/histo
gram.html)
Methods to visualize Numerical Variables
Interpretation: Table 2.

As shown in Table 2, the data looks Frequency distribution for Growth

more organize although some detail is lost

since we can no longer see the actual
observations.
From the table, we can observe that
the most likely return for the Growth
variable is between 0% and 25%, with
48.57% of the observations. No
observations fall between 50% and 75%
and only one observation falls between
75% and 100%.
Methods to visualize Numerical Variables
B. Histogram for numerical
variables
A histogram is a series of rectangles where
the width and height of each rectangle
represent the interval width and frequency
(or relative frequency) of the respective
interval.
(
https://www.excel-easy.com/data-analysis/c
harts.html
)
Methods to visualize Numerical Variables
C. Line chart for numerical
variables
Line charts are used to display trends
over time. Use a line chart if you have
text labels, dates or a few numeric labels
on the horizontal axis.
(https://www.excel-easy.com/
examples/line-chart.html) We observe from this figure that the population
of bears has grown consistently from 2017 to
2022.
Methods to Visualize the Relationship Between Two
Categorical Variables
A. Contingency tables
Table 3.

Contingency table for location and purchase example

A contingency table shows the
frequencies for two categorical
variables, x and y, where each cell
represents a mutually exclusive
combination of the pair of x and y.
( From Table 3, it can be noted that of the 600 email
How to Create a Contingency Table in Ex recipients, 410 of them made a purchase using the
cel – promotional s=discount. However, there appears to be some
differences depending on location, recipients residing in the
Statology) South (130 out of 154) and West (101 out of 119) were a lot
more likely compared to those in the Midwest (77 out of 184)
and Northeast (102 out of 143).
Methods to Visualize the Relationship Between Two
Categorical Variables
B. Stacked Column Chart
The information in a contingency table
can be shown graphically using a stacked
column chart. Highlight the cells of the
contingency table, choose Insert, Insert
column or bar chart, stacked column.
It is designed to visualize more than one
categorical variable plus it allows the
comparison of composition with each
category. https://www.youtube.com/watch?v=0zjbF9rTHA4
Methods to Visualize the Relationship Between Two
Numerical Variables
Figure 2.
A. Scatterplot Scatterplot between growth and mutual funds value
A scatterplot is a graphical tool that
help in determining whether or not two
numerical variables are related in some
systematic way. Each point in a scatter
plot represents a paired observation for
the two variables.
From Figure 2, we can infer that there is a positive
relationship between Growth and Mutual fund values,
that is, as the annual return for growth increases, the
annual return for mutual fund value tends to increase as
well.
Methods to Visualize the Relationship Between Three
Numerical Variables
B. Bubble plot
A bubble plot shows the pattern of
relationship between three numerical
variables. The third numerical variable is
represented by the size of the bubble.
For example, plot life expectancy against
birth rate and use the size of the bubble
to represent the countries’ GNP to
enable one to understand the
relationship between the 3 variables. How to quickly create bubble chart in Excel? (e
xtendoffice.com)
Methods to Visualize the Relationship Between two
numerical variables
Interpretation Figure 3. A bubble plot of life expectancy, birth rate and GNP

We observe from Figure 3 that a

country’s average life expectancy and
birth rate display a negative
relationship. We also see that
countries with low birth rates and
high life expectancies have higher
GNP.
OTHER
GRAPHIC
AL
DISPLAY
S
SUMMARY MEASURES
SUMMAR
Y
MEASUR
ES
SUMMARY MEASURES
SUMMARY MEASURES
Excel and R
Function
names
Outliers
The boxplots on the
left illustrates several
to many outliers on
each dataset which
may tend to have
significant effect on
the normality of the
data set if not treated
well.

https://github.com/ropensci/plotly/issues/1114
Using the z-scores to check for Outliers:
1. Determine whether 75 is an outlier in a given
distribution with a mean of 60 and standard
deviation is 10.
Solution:
Find the z-score that corresponds to X = 75.
x   75  60
z   1.5
 10
X = 75 is not an outlier since it is within the
acceptable interval [-3, 3].
Using the z-scores to check for Outliers:

2. Given that the average hourly rate in a certain stock is

4.5% with a standard deviation of 1.75%, assuming that
data is normally distributed, tell whether each of the
following hourly rate is normally distributed:
a. 6.23% b. 3.89%
Solution:
Find the z-score that corresponds to;
a. X = 6.23% or 0.0623 6.23% is an outlier since it is
x   0.0623  0.045 outside the acceptable interval
z   3.27 [-3, 3].
 0.0175
Using the z-scores to check for Outliers:
2. b. 3.89%
Solution:
Find the z-score that corresponds to;
b. X = 3.89% or 0.0389
x   0.0389  0.045
z   0.35
 0.0175

3.89% is not an outlier since it is within the

acceptable interval [-3, 3].
Using the z-scores to check for Outliers:
3. Soundbar Sales
Christian and Alex work in a multimedia store where they sell soundbars.
For proper inventory and monitoring they recorded the number of sales they
made each month. In the past 12 months, they sold the following numbers
of soundbars:
Alex : 34, 47, 1, 15, 57, 24, 20, 11, 19, 50, 28, 37.
Christian: 51, 17, 25, 39, 7, 49, 62, 41, 20, 6, 43, 13.
Using the z-scores method, find out if 57 is an outlier for the sales of
Alex and find out if 6 is an outlier for the sales of Christian (Hint: Use JASP
to solve for sample mean & SD for each first) then briefly interpret your
answers/solutions.
Reference
Definitions, selected tables and images were lifted from:

Business Analytics: Communicating with Numbers by Jaggia, S., Kelly,

Business Analytics: Data Analysis and Decision Making by Albright, S.C.

Carestream DRX Ascend Carestream Service Manual
89% (9)
Carestream DRX Ascend Carestream Service Manual
100 pages
6 - KNN Classifier
No ratings yet
6 - KNN Classifier
10 pages
4 Anime
No ratings yet
4 Anime
3 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
DataMining S
No ratings yet
DataMining S
103 pages
Introduction To Data Mining With Case Studies - Sample Index
0% (1)
Introduction To Data Mining With Case Studies - Sample Index
16 pages
Guide To Data-Viz
No ratings yet
Guide To Data-Viz
16 pages
Data Wrangling
0% (1)
Data Wrangling
7 pages
Assignement - Data Science For Business Growth and Big Data and Business Analytics
No ratings yet
Assignement - Data Science For Business Growth and Big Data and Business Analytics
5 pages
11-12 Big Data Concepts and Tools
No ratings yet
11-12 Big Data Concepts and Tools
30 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
19 pages
Lesson 6 Data Life Cycle Part 2
No ratings yet
Lesson 6 Data Life Cycle Part 2
30 pages
Data Visualization Discovery Better Business Decisions 106672
100% (1)
Data Visualization Discovery Better Business Decisions 106672
35 pages
Data Cleaning 1728415892
No ratings yet
Data Cleaning 1728415892
10 pages
2nd Unit - 2.2 - Data Analytics
No ratings yet
2nd Unit - 2.2 - Data Analytics
22 pages
Data Warehousing and Data Mining - Handbook
0% (2)
Data Warehousing and Data Mining - Handbook
27 pages
Data Mining
No ratings yet
Data Mining
87 pages
Data Quality and Data Cleaning: An Overview
0% (1)
Data Quality and Data Cleaning: An Overview
132 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
Digital Scholarship Laboratory Workshop Series: I Can Email You A License To Install Tableau On Your Laptop!
No ratings yet
Digital Scholarship Laboratory Workshop Series: I Can Email You A License To Install Tableau On Your Laptop!
18 pages
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
100% (2)
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
8 pages
Data Analysis and Visualization
No ratings yet
Data Analysis and Visualization
4 pages
Data Analytics Consulting: Mohammad Waseem Shaikh 17cs002052
No ratings yet
Data Analytics Consulting: Mohammad Waseem Shaikh 17cs002052
16 pages
Lesson 2 Linear Regression
100% (1)
Lesson 2 Linear Regression
21 pages
Assignment 1&2
No ratings yet
Assignment 1&2
4 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
1 page
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
Exploratory Data Analysis
100% (3)
Exploratory Data Analysis
791 pages
Analysis Vs Reporting
No ratings yet
Analysis Vs Reporting
21 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
22 pages
Data Analytics
No ratings yet
Data Analytics
12 pages
Fourth Edition: Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization
No ratings yet
Fourth Edition: Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization
66 pages
Principles of Data Visualization 2
No ratings yet
Principles of Data Visualization 2
16 pages
What Is A DSS?: Decision Support Systems Concepts, Methodologies, and Technologies: An Overview
No ratings yet
What Is A DSS?: Decision Support Systems Concepts, Methodologies, and Technologies: An Overview
9 pages
Lecture1 Big Data
No ratings yet
Lecture1 Big Data
47 pages
Data Science Portfolio
No ratings yet
Data Science Portfolio
17 pages
Bachelor of Science in Accountancy: Program Curriculum Ay 2020 - 2021
No ratings yet
Bachelor of Science in Accountancy: Program Curriculum Ay 2020 - 2021
6 pages
Data Science Answers
No ratings yet
Data Science Answers
2 pages
Lesson 5 Data Wrangling in Data Science.
100% (1)
Lesson 5 Data Wrangling in Data Science.
11 pages
1.1 Introduction To Data Analysis
No ratings yet
1.1 Introduction To Data Analysis
8 pages
Data Exploration & Visualization
No ratings yet
Data Exploration & Visualization
23 pages
Chapter 5,6 Regression Analysis
50% (2)
Chapter 5,6 Regression Analysis
44 pages
Report Design & Data Monitor Using Businessobjects Dashboard Design
No ratings yet
Report Design & Data Monitor Using Businessobjects Dashboard Design
74 pages
Power BI Data Analyst
No ratings yet
Power BI Data Analyst
51 pages
20IT503 - Big Data Analytics - Unit2
No ratings yet
20IT503 - Big Data Analytics - Unit2
62 pages
02 - Data Analytics Prefessional Course
100% (1)
02 - Data Analytics Prefessional Course
16 pages
Business Analytics & Data Visualization - Unit1
100% (1)
Business Analytics & Data Visualization - Unit1
30 pages
Data Analyst Syllabus
No ratings yet
Data Analyst Syllabus
25 pages
Session 3 4 Data Literacy Privacy Ethics
No ratings yet
Session 3 4 Data Literacy Privacy Ethics
19 pages
Lesson1 - Data Definitions
No ratings yet
Lesson1 - Data Definitions
57 pages
Chapter02 - Nature of Data, Statistical Modelling, and Visualization
No ratings yet
Chapter02 - Nature of Data, Statistical Modelling, and Visualization
102 pages
Getting Started With Tableau Prep
No ratings yet
Getting Started With Tableau Prep
3 pages
Exploratory Data Analysis and Data Preprocessing - Dr. Haleema
No ratings yet
Exploratory Data Analysis and Data Preprocessing - Dr. Haleema
11 pages
Everything You Need For Clear and Efficient Data Visualization
No ratings yet
Everything You Need For Clear and Efficient Data Visualization
41 pages
Introduction To MS Power BI Desktop - Exercise 02 - Deeper Understanding Power BI ETL - V03
No ratings yet
Introduction To MS Power BI Desktop - Exercise 02 - Deeper Understanding Power BI ETL - V03
6 pages
ML Use Cases Ebook
100% (2)
ML Use Cases Ebook
53 pages
Chapter 4 Data Mining
No ratings yet
Chapter 4 Data Mining
5 pages
Data Wrangling
No ratings yet
Data Wrangling
6 pages
Week 4 DMM(1) (1)
No ratings yet
Week 4 DMM(1) (1)
21 pages
1 ASAP Business Analytics Introduction
No ratings yet
1 ASAP Business Analytics Introduction
25 pages
Curriculum Vitae: Personal Data
No ratings yet
Curriculum Vitae: Personal Data
3 pages
ReleaseNote - FileList of X64W11 - 21H2R - SWP - X415JAB - 01.00
No ratings yet
ReleaseNote - FileList of X64W11 - 21H2R - SWP - X415JAB - 01.00
6 pages
How To Update Manually Xperia Phone by Flashing FTF File by Flash Tool
No ratings yet
How To Update Manually Xperia Phone by Flashing FTF File by Flash Tool
33 pages
ITE8 Chp4
No ratings yet
ITE8 Chp4
30 pages
How To Enable LDAP Authentication: Classification: (Protected)
No ratings yet
How To Enable LDAP Authentication: Classification: (Protected)
11 pages
download-resume-5
No ratings yet
download-resume-5
1 page
1 - Concurrent Programming
No ratings yet
1 - Concurrent Programming
28 pages
Melsintia Julia Octarina: Curriculum Vitae
No ratings yet
Melsintia Julia Octarina: Curriculum Vitae
11 pages
Re Investors - 016
No ratings yet
Re Investors - 016
424 pages
Boxispull
No ratings yet
Boxispull
2 pages
Project-Report Rangate Swati (Tyco)
No ratings yet
Project-Report Rangate Swati (Tyco)
22 pages
Software Testing Lab Manual
100% (1)
Software Testing Lab Manual
23 pages
Reflective Nursing Essay Examples
100% (2)
Reflective Nursing Essay Examples
6 pages
DIAdem Manul - 373082m
No ratings yet
DIAdem Manul - 373082m
98 pages
Diamond Manual
No ratings yet
Diamond Manual
12 pages
Akshar Rathish
No ratings yet
Akshar Rathish
5 pages
Ellipse: Superior Planning, Management and Optimization of Next-Generation Backhaul Networks
No ratings yet
Ellipse: Superior Planning, Management and Optimization of Next-Generation Backhaul Networks
4 pages
BMS-VR-Guide
No ratings yet
BMS-VR-Guide
12 pages
Autohedge: User Guide
No ratings yet
Autohedge: User Guide
17 pages
A Traffic Classification Method With Spectral
No ratings yet
A Traffic Classification Method With Spectral
4 pages
The Ethics and Impact of Digital Immortality
No ratings yet
The Ethics and Impact of Digital Immortality
19 pages
Escudero t3
No ratings yet
Escudero t3
4 pages
2 - Systems Integration Objectives: Enterprise Systems For Management Instructor's Manual - Motiwalla & Thompson
No ratings yet
2 - Systems Integration Objectives: Enterprise Systems For Management Instructor's Manual - Motiwalla & Thompson
7 pages
Coursework Tasks Specification
No ratings yet
Coursework Tasks Specification
6 pages
21 Ev 15
No ratings yet
21 Ev 15
4 pages
Best Resume Format Ever
100% (2)
Best Resume Format Ever
8 pages
Students - Unit - 1 - Network Security
No ratings yet
Students - Unit - 1 - Network Security
60 pages
File and Stream
No ratings yet
File and Stream
41 pages