0% found this document useful (0 votes)
102 views

Introduction and ETL

This document summarizes a course on data analytics and visualization. The course aims to help students understand data analytics concepts, perform Extract-Transform-Load processes, conduct descriptive analytics for business intelligence, construct predictive models for different business applications, and develop dashboards for data visualization. The course structure covers topics such as introductions to data analytics, ETL processes, descriptive analytics, predictive modelling, and data visualization.

Uploaded by

FucKerWengie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views

Introduction and ETL

This document summarizes a course on data analytics and visualization. The course aims to help students understand data analytics concepts, perform Extract-Transform-Load processes, conduct descriptive analytics for business intelligence, construct predictive models for different business applications, and develop dashboards for data visualization. The course structure covers topics such as introductions to data analytics, ETL processes, descriptive analytics, predictive modelling, and data visualization.

Uploaded by

FucKerWengie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 125

MEME19403 Data Analytics

and Visualization
Master of Mathematics,
DMAS, UTAR

Instructor: Dr. Chang Yun Fah


Course
Outcomes
 Understand the data analytics concepts;
 Perform Extract-Transform-Load (ETL)
process;
 Perform descriptive analytics for business
intelligence;
 Construct predictive models for different
business applications;
 Develop dashboard for data visualization.

Prepared by: Dr. Chang Yun Fah 2


Course Structure
Topic 1: What is data Analytics hierarchy
Introduction to Data analysis/analytics & CRISP-DM model
Analytics

Topic 2: ETL Getting start with Extract/Integrate data, data


Processes Excel, data sources cleansing/normalization
Topic 3: Descriptive Descriptive statistics Relationship &
Analytics & frequency table PCA

Topic 4: Predictive Unsupervised/Super Model validation &


Modelling vised learning optimization

Topic 5: Data Good practices & Building


Visualization charts types dashboard

Prepared by: Dr. Chang Yun Fah 3


Topic 1:
Introduction
to Data
Analytics

Prepared by: Dr. Chang Yun Fah 4


Introduction to Data Analytics
 Business domains of data science.
 Data science process and skill sets.
 What is data analysis and data analytics?
 Analytics hierarchy and types.
 The CRISP-DM model

Prepared by: Dr. Chang Yun Fah 5


Basic Domains of Data Science

Prepared by: Dr. Chang Yun Fah 6


Data Science Process Analyse and
Model
Exploratory
Data Analytics
Store and Process
Clean/
Raw Data Data is
Integrate
Collected Processed
Dataset
Models &
Algorithms
Understand and Decide
Communicate
Business
Data Product Visualize
Insights
Report

Business/ Interface/
Public Users
domains
Prepared by: Dr. Chang Yun Fah 7
Skill Sets of a Data Scientist

Computer
Science
IoT & Data Machine
Processing Learning
Data
Science
Traditional
Domain Research Maths &
Knowledge Statistics

Prepared by: Dr. Chang Yun Fah 8


Analytics
 Definition: The discovery, interpretation and
communication of meaningful patterns in data using
tabulation and visualization techniques to recommend
action or guide decision making based on business
insights produced from data analysis.

 Analytics relies on the simultaneous application of


computer programming and quantitative techniques
likes statistics and operations research to quantify
performance

Prepared by: Dr. Chang Yun Fah 9


Differences between Analytics and Analysis

Analytics

Data Analysis:
Insights
Taking a look at historical data
uncovers insights of what worked, what
did not and what is possibly expected
out of a product and service.
Business Context:
Helps organizations utilize the
potential of their data to identify
new opportunities and helps
business outline the way forward
• Continuous
• Skills
and iterative • Gain insights
• Technologies
exploration and drive
• Applications
and planning
• Practices
investigation
Prepared by: Dr. Chang Yun Fah 10
Analytics Hierarchy Chart
Analytics

Business Advanced
Intelligence Analytics

OLAP Reports & Data Descriptive Predictive Optimization Text Multimedia


(Queries) Dashboards Discovery Modelling Analytics & Simulation Analytics Analytics

Data Infrastructure
RDBS, Hadoop, Text Indexing, NoSQL, Files

Structured Data Semi-Structured Data Unstructured Data


tables XML, graphs, series Texts, images, audio, video

Prepared by: Dr. Chang Yun Fah 11


• provide insight into what has
happened
• Looks at past performance and
understands that performance by
Descriptive mining historical data to look for the
Analytics reasons behind past success or failure.
• Almost all management reporting such
as sales, marketing, operations, and
finance, uses this type of post-mortem
analysis

• helps model and forecast what


might happen
• Uses historical data to determine the
probable future outcome of an event or a
Predictive likelihood of a situation occurring.
Analytics • It encompasses a variety of statistical
techniques from modeling, machine
learning, data mining and game theory
that analyze current and historical facts
to make predictions about future events

• determine the best solution or


outcome among various choices,
given the known parameters
Prescriptive • Automatically synthesizes big data,
mathematical sciences, business rules,
Analytics and machine learning to make
predictions and then suggests decision
options to take advantage of the
predictions
Prepared by: Dr. Chang Yun Fah 12
CRISP-DM Model

Prepared by: Dr. Chang Yun Fah 13


Topic 2: ETL
Processes
with Excel

Prepared by: Dr. Chang Yun Fah 14


ETL Processes
 Data sources and types
 Extracting and integrating data
 Data cleansing – pivot & de-pivot table,
missing value imputation, outlier detection
 Data preparation - normalization

Prepared by: Dr. Chang Yun Fah 15


Getting Start with Excel 2013
 Enabling Developer tab and Macros
 Download Add-Ins for Excel
 Introducing Visual Basic for Application
(VBA)

Prepared by: Dr. Chang Yun Fah 16


Developer and Macros
 MS Excel is one of the most powerful
applications within the Office suite. It lets you
add some programmatic power to automate
tasks in Excel.
 Two examples of this scenario are Visual Basic
for Application (VBA) and Macros, enabled by
Developer tab.
 Developer tab lets you add code chunks to your
Excel sheet and access various Options.
Prepared by: Dr. Chang Yun Fah 17
File 1

Options

4
Customize 3
Ribbon

Main Tabs
5

Check
Developer

2
Prepared by: Dr. Chang Yun Fah 18
 The Developer tab contains different useful options in 5
categories for developers.
1. Code
 Visual Basic: launches the VB Editor.
 Macros: displays the Macro dialog, where you can choose
to run or edit a macro.
 Record Macro: begins the process of recording a macro.
 Use Relative Reference: toggles between using relative or
absolute recording. Relative: Excel will record that you
move down 3 cells. Absolut: Excel will record that you
selected cell A4.
 Macro Security: play with the security aspects of the Macro.
Prepared by: Dr. Chang Yun Fah 19
2. Add-Ins
 Add-Ins: enable Excel’s Add-Ins like the Solver.
 COM Add-Ins: enable the COM add-in which is useful for
developers as it aids in writing common object models.
3. Controls
 It includes functionality to add the user-interface controls like
ActiveX controls to your Excel spreadsheet. You can move
between Code and Design Mode, and can play with the
Control’s Properties and run Dialog.
4. XML
 Contains useful options for XML coding in Excel such as write
XML source code, import and export XML files, map and
refresh your XML data.
5. Modify
 Contains the Document Panel option only.

Prepared by: Dr. Chang Yun Fah 20


 Inorder to run macros without any annoying
Developer
security warnings, you may enable all macros.
1
Macro
Security 2

Macro
Settings 4
3
5
Enable all
macros

Check “Trust
access to
the VBA
project
object
model”

Prepared by: Dr. Chang Yun Fah 21


Enable Regular Add-Ins
 An add-in is software that adds new features into
Excel, created by developers using VBA.
 Add-ins save you time, helps to avoid errors and
do repetitious work in minutes that could take
hours manually.
 Excel built-in add-ins: Analysis ToolPak,
Analysis ToolPak-VBA and Solver.
 There are many available add-ins, some are free
of charges.
Prepared by: Dr. Chang Yun Fah 22
1
Developer 2

Excel
Add-Ins
• Analysis ToolPak: provides
3
data analysis tools for
Check statistical and engineering
Analysis ToolPak
Analysis ToolPak- analysis.
VBA
Solver Add-In
• Analysis ToolPak – VBA: VBA
functions for Analysis ToolPak.

• Solver Add-In: tool for


optimization and equation
solving.

Prepared by: Dr. Chang Yun Fah 23


Download free Add-Ins
We recommend the following free add-ins:
 Statistical Analysis add-in
 Real Statistics Resource Pack
 http://www.real-statistics.com/free-download/real-
statistics-resource-pack/
 Statistical tools: distributions, ANOVA, correlation,
non-parametric tests, time series, survival analysis,
missing value, reliability, regression models,
multivariate analysis etc.

Prepared by: Dr. Chang Yun Fah 24


1

Browse Real
Statistics
resource Pack
webpage

Free
Download
2

Save 3 C:\Users\user-
“RealStats.xls
m” in the given name\AppData\Roaming\ 3
folder Microsoft\AddIns

Developer
4

Add-Ins 5

Prepared by: Dr. Chang Yun Fah 25


Browse 1

2
Open
“RealStats”
from the folder

OK and check
“Realstats” 3 3

Add-Ins tab is
Note: You may need to enabled the add-
created ins every time you start Excel. This is a
common problem with Excel 2013.

Real
Statistics 4
5

Prepared by: Dr. Chang Yun Fah 26


75 Add-Ins for MS Excel
 https://www.powerusersoftwares.com/single-
post/2016/08/22/75-of-the-best-add-ins-
plugins-and-apps-for-Microsoft-Excel-free-or-
not
 GIGRAPH-Network Visualization: turn table
into network
 Geographic Heat Map
 Power Query (Excel 2010 & 2013 only):
import, transform or combine multiple data
sources.
Prepared by: Dr. Chang Yun Fah 27
ETL Processes
 Extracting and integrating various data into
single worksheet
 Data management and cleaning

Prepared by: Dr. Chang Yun Fah 28


Case Study 1:
You are a marketing manager of a supermarket. You want
to understand your customers characteristics by performing
customer segmentation. However, the data of product
pricing, customer information, customer distance from the
supermarket are scattered at different sources. Thus, you
need to extract the data and combine them into a single
database and perform data cleansing before you can
cluster your customers into different segments.

Data Source: Pennacchioli, D., Coscia, M., Rinzivillo, S., Pedreschi, D. and Giannotti,
F., Explaining the Product Range Effect in Purchase Data. In BigData, 2013.
Prepared by: Dr. Chang Yun Fah 29
Understand the data
 File name:
supermarket_prices
 Two columns. The first column
is the product id. The second
column is its unit price.
 The price is in Euro and it is
calculated as the average unit
price for the time span of the
dataset.

Prepared by: Dr. Chang Yun Fah 30


Excel: Importing Data from Text
or CSV File
Data tab

2 1

From Text

Select 3
your text
file

Click
Import
4

Prepared by: Dr. Chang Yun Fah 31


8
Select 5
delimited file
6

Set row to 7
start 9

Click Next
10

Select Click
delimiter Finish

11
Click Next Choose
column data
format
Prepared by: Dr. Chang Yun Fah 32
Define
data
location

Click OK 12

13

There are 4657 products’ ID and Price.


Save the file as “supermarket_prices.xlsx

Prepared by: Dr. Chang Yun Fah 33


Understand the data
 File Name :
supermarket_distances:
 Three columns. The first
column is the customer id, the
second is the shop id and the
third is the distance between
the customer’s house and the
shop location.
 The distance is calculated in
meters as a straight line so it
does not take into account the
road graph.

Extract supermarket_distance.csv to
xlsx format.
Prepared by: Dr. Chang Yun Fah 34
Understand the data
 File name: supermarket_purchases
 Four columns. The first column is the customer id, the second is the
product id, the third is the shop id and the fourth is the total amount
of items that the customer bought the product in that particular shop.
 The data is recorded from January 2007 to December 2011.

Filename Worksheet Label # of observatioins


Supermarket_purchae_1 1 15630
Supermarket_purchae_2 2 14634
Supermarket_purchae_345 3 1737
Supermarket_purchae_345 4 2221
Supermarket_purchae_345 5 3294
Total 37516

Prepared by: Dr. Chang Yun Fah 35


Combine Worksheets from Active
Workbook
 Open Excel file: Supermarket_purchase_345
 The following VBA code can help you to get data from all
worksheets of active workbook together into a new single
worksheet.
 All worksheets must have the same field structure, same column
headings and same column order.
 Your data must start from A1, if not, the code will not take effect.

Prepared by: Dr. Chang Yun Fah 36


2
Press ALT
& F11 to 1
open VBA 3
window

Click Insert

Click
Module
4

Write the
given
codes

Press F5 to
execute
the codes This number indicate the Column A and
last row 1048576 of the worksheet
Prepared by: Dr. Chang Yun Fah 37
Sub Combine()
Dim J As Integer
On Error Resume Next
Sheets(1).Select
Worksheets.Add
Sheets(1).Name = "Combined"
Sheets(2).Activate
Range("A1").EntireRow.Select
Selection.Copy Destination:=Sheets(1).Range("A1")
For J = 2 To Sheets.Count
Sheets(J).Activate
Range("A1").Select
Selection.CurrentRegion.Select
Selection.Offset(1, 0).Resize(Selection.Rows.Count - 1).Select
Selection.Copy Destination:=Sheets(1).Range("A1048576").End(xlUp)(2)
Next
End Sub

Prepared by: Dr. Chang Yun Fah 38


 A new worksheet “Combined” will be
created with 162635 observations
combining Supermarket_purchase_3, 4
and 5.

Prepared by: Dr. Chang Yun Fah 39


Queries (install Power Query for
Excel 2013 & 2010)
 Append worksheets from active workbook
 Append worksheets from different
workbooks
 Merge worksheets using different join
types (left outer, right outer, full outer, left
anti, right anti)

Prepared by: Dr. Chang Yun Fah 40


Create a file “Queries_Practice” with 2 worksheets as follow:
There are 2 or
more
worksheets in
the same
workbook
1
1
Data tap

Click Get 2
Data 3

Select “Launch
Query Editor”

Prepared by: Dr. Chang Yun Fah 41


5
From the
Query Editor,
6
select “New
Source”

Select
“Excel”

Import excel data


from
“Query_Practice”
or different files

From the
Navigator,
select your first 7
worksheet
“Sheet1” & click
OK

8
Repeat New
Source to get 9
“Sheet2” Prepared by: Dr. Chang Yun Fah 42
Click “Append 10
Queries” tap

From Append
dialog box,
select the
second
worksheet & OK

Click “Close
& Load”

12
OK

11

Prepared by: Dr. Chang Yun Fah 43


Note: when the original data sources are updated (eg
added new observations), then the appended/merged
table will also be updated after refreshing.

Prepared by: Dr. Chang Yun Fah 44


Importing Data from Web
Copy the
URL

https://www.google.com/finance
1
Developer
tab

3 2
Record
Macro

4
Macro
name:
Stock

Click OK
5
Prepared by: Dr. Chang Yun Fah 45
Data tab
1
Click From
2
Web

4
Copy & paste
the webpage
address &
press Go 3
3

Click No if error
statement pop
up

5
Click on the “-
>” to select
table and it will
change to tick

Click Import 6

Prepared by: Dr. Chang Yun Fah 46


Select the first
cell to store the 1
data
2 5

Click Properties

Set refresh
interval

Select row
condition 4

5
Click OK
Prepared by: Dr. Chang Yun Fah 47
After 1
minute

Save the file as Stock.xlsm


Prepared by: Dr. Chang Yun Fah 48
Extract Data from MS Access
Open access file:
Data tab
olympicmedals_Data Analytics.accdb

Click From
Access 2
1

Select an
access file &
Open
4
4

Select
multiple 5
worksheets &
click OK

3
3
5
Select Table &
Click OK Prepared by: Dr. Chang Yun Fah 49
Prepared by: Dr. Chang Yun Fah 50
Collect Data using Google Form
 At any Google Sheet,
click on the Insert tap,
follow by Form.
 A new sheet with
labeled “Form
responses” is created.
This is the sheet to store
the collected data from
Google Form.
Prepared by: Dr. Chang Yun Fah 51
Collect Data using Google Form
 To build the Google
Form, click on the Form
tab and select Edit Form.
 An empty Google Form
will appeared and create
your own desired form.
 Refer to “The ultimate
guide to google sheets”
for the details on creating
google form.
Prepared by: Dr. Chang Yun Fah 52
Syncing data between Google
Sheet and Excel
 Create a google sheet file “Master 1” and
place it in My Drive.

Prepared by: Dr. Chang Yun Fah 53


Syncing data between Google
Sheet and Excel
 In your google sheet, click file, then
Publish to the web.

Prepared by: Dr. Chang Yun Fah 54


 In the Publish to
web dialog box,
 Link to
Microsoft Excel
 Check the
Automatically
republish when
changes are
made
 Click
“Published”
and copy the
website link.
Prepared by: Dr. Chang Yun Fah 55
 Open a new Excel
2010 or 2016
 Click the Data
tap
 Choose “Get the
data from web”
 Paste the copied
website link into
the URL bar and
click OK.
 Select the level
of the URL and
click Connect

Prepared by: Dr. Chang Yun Fah 56


 Select the
worksheet and
click Load
 A table is
generated in
Excel.
 Click the Design
tab, then choose
Properties.
 Set the external
data properties
to open Query
Properties
dialog box.
Prepared by: Dr. Chang Yun Fah 57
 Check all the
boxes and set
the refreshing
time.
 Click OK.

The Excel data will be updated automatically (based on the refreshing


time given) when the google sheet data changed.

Prepared by: Dr. Chang Yun Fah 58


Share Google Sheet
 In the Google sheet, click
Share button at the top left
corner.
 In the Share dialog box, enter
the email addresses you want
to share the google sheet.
You may also add a note.
 Ensure that the invited
person can edit the google
sheet.
 Click Send, then Done.
Prepared by: Dr. Chang Yun Fah 59
Limitations of Google Sheet
 Each Google Spreadsheet has a limit of 400,000 cells, with a
maximum of 256 columns per sheet. There are also other
limitations:
 Number of Formulas: 40,000 cells containing formulas
 Number of Tabs: 200 sheets per workbook
 GoogleFinance formulas: 1,000 GoogleFinance formulas
 GoogleLookup formulas: 1,000 GoogleLookup formulas
 ImportRange formulas: 50 cross-workbook reference formulas
 ImportData, ImportHtml, ImportFeed, or ImportXml formulas: 50
functions for external data

Prepared by: Dr. Chang Yun Fah 60


Data Cleansing and
Management

Prepared by: Dr. Chang Yun Fah 61


VLookup: combine worksheets
with different structure
Supermarket_purchase_All Supermarket_price Supermarket_distance

Prepared by: Dr. Chang Yun Fah 62


Combine Purchase data and Price data based on Product ID by
creating new columns “unit price” and “total price” into purchase data.
Open
purchase
and price
files 2

Write Unit
Price & Total
Price
headers

Write
VLookup
function in
E2

3
In F2,
write
4
=E2*D2

=VLOOKUP(lookup_value,table_array,col_index_num,range_lookup)
Copy E2 & =VLOOKUP(B2,[Supermarket_price.xlsx]Price!$A$2:$B$4568,2,FALSE)
F2 and paste
them to the
remaining Prepared by: Dr. Chang Yun Fah 63
cells
Combine Purchase data and Distance data based on Product ID and
Shop ID.
Write Concat &
distance as 2
headers
1

Concatenate
customer ID and
shop ID in
Purchase file

Concatenate
customer ID and =TEXTJOIN(“delimiter”,ignore_empty,text1,text2)
shop ID in =TEXTJOIN(“A”,TRUE,A2,C2)
Distance file
3

Write VLookup
function in H2

Copy G2 & H2
and paste them to
the remaining cells
=VLOOKUP(G2,[Supermarket_distances.xlsx]Sheet1!$C$2:$D$301831,2,FALSE)
Prepared by: Dr. Chang Yun Fah 64
Merge/Combine worksheets using Query
Data tab

2
Get Data

5
Combine Queries
if there are
connections or 6
Launch Query
Editor
4 5
3

Merge
6
Select the two
merged files, 7
define the
8
reference
columns, select
Prepared by: Dr. Chang Yun Fah 65
join kind
Merge/Combine worksheets using Query
Click on the
arrows of created
column

13
Select the 9
columns from
second file to
merge into first file

Select Expand or 11
Aggregate of the 10
second file

12
OK

Click “Close &


Load”
Prepared by: Dr. Chang Yun Fah 66
Merged Worksheet

Prepared by: Dr. Chang Yun Fah 67


Create Pivot Table
Insert tab
1

Click Pivot 2
Table
3

Select range
A1:C10
4

Create pivot
table in new 5
worksheet

Click OK
Prepared by: Dr. Chang Yun Fah 68
Put your cursor
in any cell within
pivot table
background

Select Users as
Rows 5

1
Select Perms as
Column

Select Value as
Sum of Values 3

Double click on 2 4
Grand Total cell
“186” to get
database table

Prepared by: Dr. Chang Yun Fah 69


De-pivot Table
Press
Alt + D + P
3

Select “Multiple 3
consolidation
ranges” & Next

2
Select Create a 4
single page field
for me” & click
Next

Select range 2
G1:J4 and click
5
Add

Click Finish,
follow by No
5
Prepared by: Dr. Chang Yun Fah 70
De-pivot Table
Double click on
Grand Total cell
“186 “

Change the
headers’ name
and delete
Page1 column 1

Prepared by: Dr. Chang Yun Fah 71


Exercise
Generate PivotTable for
the following files:
1) Supermarket_purchase
_1
2) Supermarket_purchase
_2
3) Supermarket_purchase
_Combined345

For each customer (customer_id), summarize the total


quantity purchased for all products, and the number of
visits to the shop.

Prepared by: Dr. Chang Yun Fah 72


Consolidate data in multiple
worksheets by position
 When the data in the source areas is
arranged in the same order and used the
same label. Use this method to
consolidate data from a series of
worksheets, such as departmental budget
worksheets that have been created from
the same template.

Prepared by: Dr. Chang Yun Fah 73


Ensure all files Make sure that each range of data is in list format with the same
have same layout, so that each column has a label in the first row and contains
format similar data, and there are no blank rows or columns within the list.

Open new
worksheet
3 4

Click Data tap

Click
5 6
Consolidate 6 6

Select type of
Function

Browse & select


the cell area from
the file
Prepared by: Dr. Chang Yun Fah 74
Click Add

1
2
Repeat to add
more worksheets

Check labels 3 4
5
boxes

Check “Create If you checked this box,


links to source
data” to Excel will update your
automatically consolidate table
update the data automatically when the
when the source
data changes source data changes

Click OK

Prepared by: Dr. Chang Yun Fah 75


Group Columns
Worksheets with a lot of content can sometimes feel overwhelming and
Select the
columns, E even become difficult to read. Fortunately, Excel can organize data
and F in groups, allowing you to easily show and hide different sections of
your worksheet.
Click Data
tap
3
2
5
Click Group

Select 1
Column &
OK

Click Ungroup
to remove the
grouping 4

Prepared by: Dr. Chang Yun Fah 76


Subtotal 1

Sort and select


the data

Click Data tap


2

3
Click Subtotal

Select the
grouping criteria
5

Select the
functions used &
variable

6
6
Select “replace” &
“summary”
checkboxes & OK Prepared by: Dr. Chang Yun Fah 77
The subtotals are
inserted as new
rows below each
observation group.
Eg: 803.968 is the
sum of total prices
for customer 1

Prepared by: Dr. Chang Yun Fah 78


Remove Duplicates
When there are 2 or more rows with the same values, this function
helps you to remove all the duplicated cases and remain the first
Locate the row only.
active cell

2
3
Click Data tap

Click Remove
Duplicates

Select features
4

5
Click OK

Prepared by: Dr. Chang Yun Fah 79


Outlier Treatment
 An outlier is an observation that lies an abnormal distance from
other values in a random sample from a population

 Reasons for outliers:


Data errors, Sampling error, Standardisation failure, Faulty
distributional assumptions, Human Error, Genuine Outliers

 What do we do about outliers ?


Learn as much as you can about the "story" behind the data and
understand why there is an outlier. Is it an error? Is it something
we should expect to see in this kind of data? Etc.
it is NOT acceptable to drop an observation just because it is an
outlier. They can be legitimate observations and are sometimes
the most interesting ones.

Prepared by: Dr. Chang Yun Fah 80


 If it is obvious that the outlier is due to incorrectly entered or
measured data, you should rectify/drop the outlier.
 Or else, retains the outlier:
 Try a transformation. Square root and log transformations
both pull in high numbers. This can make assumptions
work better if the outlier is a dependent variable and can
reduce the impact of a single point if the outlier is an
independent variable.
 Try a different model. This should be done with caution,
but it may be that a non-linear model fits better. For
example, in example 3, perhaps an exponential curve fits
the data with the outlier intact.

Prepared by: Dr. Chang Yun Fah 81


Raw Data

Identify potential
outliers

Verify source
of variation

Incorrect Random variation/


entry/ errors true observation

Cannot be Can be Transform Use robust


rectified rectified data methods

Remove the Keep and study


observations the observations

Prepared by: Dr. Chang Yun Fah 82


Methods to Identify Outliers
Graphical Analytical
 Histogram  IQR (Q1-1.5xIQR,
 Boxplot Q3+1.5xIQR) and (Q1-
 Scatter plot
3.0xIQR, Q3+3.0xIQR)
 Normal probability plot
 Various outlier detection
tests.
 For categorical data, use
Filter tab in Excel.

Prepared by: Dr. Chang Yun Fah 83


Missing Value Imputation
 Missing value imputation is the process of replacing
missing data with substituted values.
 Because missing data can create problems for analyzing
data, imputation is seen as a way to avoid pitfalls.
 List wise deletion of cases that have missing values:
discarding any case that has a missing value, which may
introduce bias or affect the representativeness of the
result.

Prepared by: Dr. Chang Yun Fah 84


Missing Value Imputation
 Some of the Missing Value Imputation Techniques are:
Mean (zero, median or mode): the mean of the observed
values for that variable.
Substitution: the value from a new individual who was not
selected to be in the sample
Hot deck: a randomly chosen value from an individual who
has similar values on other variables
Cold deck: a systematically chosen value from an individual
who has similar values on other variables

Prepared by: Dr. Chang Yun Fah 85


Missing Value Imputation
 Some of the Missing Value Imputation Techniques are:
Regression: the predicted value obtained by regressing the
missing variable on other variables
Stochastic regression: the predicted value from a regression
plus a random residual value.
Interpolation and extrapolation: an estimated value from other
observations from the same individual.

Prepared by: Dr. Chang Yun Fah 86


Data Normalization
 What is data normalization?
Data normalization is to transform the data
(or attributes) with various units of
measurement into unit-free.
Since the range of values of raw data varies
widely, in some predictive models, optimum
result will not be obtained without
normalization.
Prepared by: Dr. Chang Yun Fah 87
Predictive models that affected
by data normalization
 K-nearest neighbours with an Euclidean
distance measure
 K-means
 Logistic regression, SVM, neural networks
etc.
 Linear discriminant analysis, PCA

Prepared by: Dr. Chang Yun Fah 88


Data Normalization

Prepared by: Dr. Chang Yun Fah 89


Example
 A sample data set of mobile profile users
is used to find the mobile users who have
similar profiles, that is, that has similar use
of the phone based on the call and SMS
logs.

Prepared by: Dr. Chang Yun Fah 90


 The Euclidean distance matrix for all users
is

 We conclude that the user 1 and user 4


have the most similar profiles since the
distance between them is 2000, which is
the least compared to others.
Prepared by: Dr. Chang Yun Fah 91
 However, if you see the list again, you can
see that the Euclidean distance values are
very close to the differences in duration
calls. Eg. the difference between the
duration calls of users 1 and 4 is 2000,
and the Euclidean distance is 2000.30303.
 So it looks like this approach seems to be
flawed. The Euclidean distances are
dominated by the duration calls amount.

Prepared by: Dr. Chang Yun Fah 92


 Let’s apply the min-max normalization on
the data set yields

 The new distance matrix is

Prepared by: Dr. Chang Yun Fah 93


 Compare the Euclidean distance before
and after normalization. The distances are
no more dominated by the duration calls
attribute and they make sense now.

Prepared by: Dr. Chang Yun Fah 94


Topic 3:
Descriptive
Analytics

Prepared by: Dr. Chang Yun Fah 95


Descriptive Analytics
 Descriptive analysis
 Frequency table & cross-tabulation
 Relationship & correlation analysis
 Principal component analysis

Prepared by: Dr. Chang Yun Fah 96


Measure of Central Tendency
 Mean: average for a set of data.
 Mod: most frequent score in a set of data.
 Median: middle score for a set of data.

Example: find the mean, mod and median for


the dataset 3, 6, 3, 2, 5, 4, 7, 6, 3, 9, 1

Prepared by: Dr. Chang Yun Fah 97


Measure of Dispersion
 Variance (or standard deviation)
 Range
 Interquartile range

Example: find the standard deviation, range and


IQR for the dataset 3, 6, 3, 2, 5, 4, 7, 6, 3, 9, 1
Prepared by: Dr. Chang Yun Fah 98
Measure of Shape
 Skewness: measure symmetrical of the shape
(amount and direction of skew)
 Kurtosis: measure how tall and sharp the central
peak

Negatively/left Skewed: Symmetrical: Positively/right Skewed:


Mean < Median < Mode Mean = Median = Mode Mean > Median > Mode
Prepared by: Dr. Chang Yun Fah 99
Generating statistic values
+ve skewness -ve skewness
Minimum value: the smallest value
Maximum value: the largest value
Range = maximum value – minimum value
1 n
mean   xi
n i 1 kurtosis

1 n
 i 
2
var iance  x  x minimum
mean
maximum
n  1 i 1 Std dv

std _ dv  var Range

std _ dv
se(mean) 
n 3
6n  n  1
1 1 2
skewness    xi  x     xi  x  
3 2
se  skewness  
n n   n  2  n  1 n  3
 1 1 2 
2
 n2  1
kurtosis     xi  x     xi  x     3 se( kurtosis )  2  se  skewness  
4

 n n    n  3 n  5 
Normal distribution hasby:
Prepared skewness
Dr. Chang Yunof
Fah0 and kurtosis of 3 100
Add Analysis ToolPak to Excel

Prepared by: Dr. Chang Yun Fah 85


Descriptive Statistics in Excel
1 2

4 8

3 5

Use “=percentile.inc(data range,


k)” where 0<k<1 represents the 7
k*100-th percentile.
Prepared by: Dr. Chang Yun Fah 102
Salary
200 169
150
Frequency

100

50 32
1 0 0 0 0 0 0 0 1
0
5640 145202340032280411605004058920678007668085560 More
Salary

Prepared by: Dr. Chang Yun Fah 103


1
4
2

5 9

6
7

Prepared by: Dr. Chang Yun Fah 104


Topic 5: Data
Visualization

Prepared by: Dr. Chang Yun Fah 105


Data Visualization
 Some good practices in designing charts
 Selection of charts
 Creating basic charts
 Building dashboard

Prepared by: Dr. Chang Yun Fah 106


Making Graphs and Diagrams
 Histograms
 Boxplot
 Scatter plots
 Bar charts
 Pie charts
 Normal Probability
Plot

Prepared by: Dr. Chang Yun Fah 107


Histograms

 It shows the frequency distribution (pattern)


of a set of continuous data based on
different classes.
 The normal distribution curve is
superimposed on the graph as a guide to
determine whether the distribution of the
data matched the characteristics of a
normal curve.

Prepared by: Dr. Chang Yun Fah 108


1 2

5
4

6
8
5
6

Prepared by: Dr. Chang Yun Fah 109


Salary
180 169
Outlier
160
140
120
Frequency

100
80
60
40 32

20
1 0 0 0 0 0 0 0 1
0
5640 14520 23400 32280 41160 50040 58920 67800 76680 85560 More
Salary

Prepared by: Dr. Chang Yun Fah 110


Boxplot (Box & Whiskers Plot)
 It is a pictorial representation of data distribution,
which showing the median, quartiles, outlier and
extreme values.
 Easy for comparing more than variables.

Line Graphs Boxplot


Messy and untidy Clear and tidy
Difficult to compare Easy to compare
Less information: only More information: median,
simple trends and minimum, maximum, inter-
fluctuations of data quartile range, etc..

Prepared by: Dr. Chang Yun Fah 111


Maximum value Extreme value (>3.0xIQR)

Outlier (1.5xIQR,3.0xIQR)
Whisker
3rd quartile (75%)
*
Inter-quartile
range (IQR)

1st quartile
(25%)

Median

Outlier (-
Minimum value
1.5xIQR,- (including outlier &
3.0xIQR) extreme values, if exists)
Extreme value
(>-3.0xIQR) Prepared by: Dr. Chang Yun Fah 112
Excel:
 Create 2-D Line chart with Markers.
 Add data series Q1 (25%) from N6, Median (50%) from N7, and Q3 (75%) from N8.
 Convert median to scatter plot with Markers
 Create box: select series Q3 and go to Layout menu and choose Up/Down Bars tab.
Delete the Marker for Q3 and Q1.
 Create whiskers: select series Q3 (and Q1) and go to Layout menu to choose Error
Bars tab. Click on the More Error Bars Options to get the dialog box. Change the
vertical error bars to Plus direction, Cap end style. For the error amount, choose
Custom and click on the Specify Value button to get a small dialog box. Select the
Positive Error Value from N11.
 Create median line: select Median series, choose Layout menu and select Error Bars
tab to choose More Error Bars Options. From Horizontal Error Bars dialog box, choose
Both direction, No Cap end style and fixed the value of error amount as 0.2.
Prepared by: Dr. Chang Yun Fah 113
Outlier

Prepared by: Dr. Chang Yun Fah 114


Simple and Overlay Scatter Plots
 A scatter plot displaying the relationship
between two variables.
 Simple scatter plot involves a dependent
variable and an ungrouped independent
variable.

Prepared by: Dr. Chang Yun Fah 115


100000.00

90000.00

80000.00

70000.00 Outlier
60000.00

50000.00

40000.00

30000.00

20000.00

10000.00

0.00
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00

Prepared by: Dr. Chang Yun Fah 116


Bar Graphs: Simple

Gender
160 148

Using “Filter” function in Excel 140


120
to determine the number of 100
categories. 80
54
60
40
20 1
0
Male Female Lelaki

Prepared by: Dr. Chang Yun Fah 117


Pie Charts

Prepared by: Dr. Chang Yun Fah 118


Line Graphs

 A line graph should be limited to displaying


time series data.
 X-axis: time in seconds, minutes, hours,
days, weeks, etc…
 Y-axis: the values of the variable.

Prepared by: Dr. Chang Yun Fah 119


Prepared by: Dr. Chang Yun Fah 120
Checking Data Normality
 Many statistical methods assume the data have
normal or nearly normal distribution.
 Checking the assumption of normality:
Method Assessment
Histogram Bell-shaped
Stem-and-Leaf Diagram Bell-shaped
Skewness, Kurtosis Skewness=0, Kurtosis=3
Boxplot/ Box&Whisker plot No outliers, symmetric
Normal Q-Q Plot data close to trend line
Detrended Normal Q-Q Plot data randomly located along the line of zero mean
Kolmogorov-Smirnov Test Small K-S Statistic, Not significant at 0.05
Shapiro-Wilk Test Small coefficient, Not significant at 0.05
Prepared by: Dr. Chang Yun Fah 121
Normal Probability Plot Dependent Variable: ringgit

1.0

0.8

Ideal: normally distributed


Expected Cum Prob

0.6

0.4
•The points lie approximately along a
0.2

straight line.
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Observed Cum Prob

Heavy/Long-tailed
distribution: The points
show a sharp upward
and downward curve at
both extremes.

Prepared by: Dr. Chang Yun Fah 122


Light/short-tailed
distribution:
Flattening at the
extremes

Negative/Left
skewed

Positive/Right
skewed
Prepared by: Dr. Chang Yun Fah 123
Normal Probability (QQ) Plot

Prepared by: Dr. Chang Yun Fah 124


Normal Probability Plot (Salary)
1.2

0.8

0.6

0.4

0.2

0
-4 -3 -2 -1 0 1 2 3 4

Salary has light-tails distribution.

Prepared by: Dr. Chang Yun Fah 125

You might also like