Introduction and ETL
Introduction and ETL
and Visualization
Master of Mathematics,
DMAS, UTAR
Business/ Interface/
Public Users
domains
Prepared by: Dr. Chang Yun Fah 7
Skill Sets of a Data Scientist
Computer
Science
IoT & Data Machine
Processing Learning
Data
Science
Traditional
Domain Research Maths &
Knowledge Statistics
Analytics
Data Analysis:
Insights
Taking a look at historical data
uncovers insights of what worked, what
did not and what is possibly expected
out of a product and service.
Business Context:
Helps organizations utilize the
potential of their data to identify
new opportunities and helps
business outline the way forward
• Continuous
• Skills
and iterative • Gain insights
• Technologies
exploration and drive
• Applications
and planning
• Practices
investigation
Prepared by: Dr. Chang Yun Fah 10
Analytics Hierarchy Chart
Analytics
Business Advanced
Intelligence Analytics
Data Infrastructure
RDBS, Hadoop, Text Indexing, NoSQL, Files
Options
4
Customize 3
Ribbon
Main Tabs
5
Check
Developer
2
Prepared by: Dr. Chang Yun Fah 18
The Developer tab contains different useful options in 5
categories for developers.
1. Code
Visual Basic: launches the VB Editor.
Macros: displays the Macro dialog, where you can choose
to run or edit a macro.
Record Macro: begins the process of recording a macro.
Use Relative Reference: toggles between using relative or
absolute recording. Relative: Excel will record that you
move down 3 cells. Absolut: Excel will record that you
selected cell A4.
Macro Security: play with the security aspects of the Macro.
Prepared by: Dr. Chang Yun Fah 19
2. Add-Ins
Add-Ins: enable Excel’s Add-Ins like the Solver.
COM Add-Ins: enable the COM add-in which is useful for
developers as it aids in writing common object models.
3. Controls
It includes functionality to add the user-interface controls like
ActiveX controls to your Excel spreadsheet. You can move
between Code and Design Mode, and can play with the
Control’s Properties and run Dialog.
4. XML
Contains useful options for XML coding in Excel such as write
XML source code, import and export XML files, map and
refresh your XML data.
5. Modify
Contains the Document Panel option only.
Macro
Settings 4
3
5
Enable all
macros
Check “Trust
access to
the VBA
project
object
model”
Excel
Add-Ins
• Analysis ToolPak: provides
3
data analysis tools for
Check statistical and engineering
Analysis ToolPak
Analysis ToolPak- analysis.
VBA
Solver Add-In
• Analysis ToolPak – VBA: VBA
functions for Analysis ToolPak.
Browse Real
Statistics
resource Pack
webpage
Free
Download
2
Save 3 C:\Users\user-
“RealStats.xls
m” in the given name\AppData\Roaming\ 3
folder Microsoft\AddIns
Developer
4
Add-Ins 5
2
Open
“RealStats”
from the folder
OK and check
“Realstats” 3 3
Add-Ins tab is
Note: You may need to enabled the add-
created ins every time you start Excel. This is a
common problem with Excel 2013.
Real
Statistics 4
5
Data Source: Pennacchioli, D., Coscia, M., Rinzivillo, S., Pedreschi, D. and Giannotti,
F., Explaining the Product Range Effect in Purchase Data. In BigData, 2013.
Prepared by: Dr. Chang Yun Fah 29
Understand the data
File name:
supermarket_prices
Two columns. The first column
is the product id. The second
column is its unit price.
The price is in Euro and it is
calculated as the average unit
price for the time span of the
dataset.
2 1
From Text
Select 3
your text
file
Click
Import
4
Set row to 7
start 9
Click Next
10
Select Click
delimiter Finish
11
Click Next Choose
column data
format
Prepared by: Dr. Chang Yun Fah 32
Define
data
location
Click OK 12
13
Extract supermarket_distance.csv to
xlsx format.
Prepared by: Dr. Chang Yun Fah 34
Understand the data
File name: supermarket_purchases
Four columns. The first column is the customer id, the second is the
product id, the third is the shop id and the fourth is the total amount
of items that the customer bought the product in that particular shop.
The data is recorded from January 2007 to December 2011.
Click Insert
Click
Module
4
Write the
given
codes
Press F5 to
execute
the codes This number indicate the Column A and
last row 1048576 of the worksheet
Prepared by: Dr. Chang Yun Fah 37
Sub Combine()
Dim J As Integer
On Error Resume Next
Sheets(1).Select
Worksheets.Add
Sheets(1).Name = "Combined"
Sheets(2).Activate
Range("A1").EntireRow.Select
Selection.Copy Destination:=Sheets(1).Range("A1")
For J = 2 To Sheets.Count
Sheets(J).Activate
Range("A1").Select
Selection.CurrentRegion.Select
Selection.Offset(1, 0).Resize(Selection.Rows.Count - 1).Select
Selection.Copy Destination:=Sheets(1).Range("A1048576").End(xlUp)(2)
Next
End Sub
Click Get 2
Data 3
Select “Launch
Query Editor”
Select
“Excel”
From the
Navigator,
select your first 7
worksheet
“Sheet1” & click
OK
8
Repeat New
Source to get 9
“Sheet2” Prepared by: Dr. Chang Yun Fah 42
Click “Append 10
Queries” tap
From Append
dialog box,
select the
second
worksheet & OK
Click “Close
& Load”
12
OK
11
https://www.google.com/finance
1
Developer
tab
3 2
Record
Macro
4
Macro
name:
Stock
Click OK
5
Prepared by: Dr. Chang Yun Fah 45
Data tab
1
Click From
2
Web
4
Copy & paste
the webpage
address &
press Go 3
3
Click No if error
statement pop
up
5
Click on the “-
>” to select
table and it will
change to tick
Click Import 6
Click Properties
Set refresh
interval
Select row
condition 4
5
Click OK
Prepared by: Dr. Chang Yun Fah 47
After 1
minute
Click From
Access 2
1
Select an
access file &
Open
4
4
Select
multiple 5
worksheets &
click OK
3
3
5
Select Table &
Click OK Prepared by: Dr. Chang Yun Fah 49
Prepared by: Dr. Chang Yun Fah 50
Collect Data using Google Form
At any Google Sheet,
click on the Insert tap,
follow by Form.
A new sheet with
labeled “Form
responses” is created.
This is the sheet to store
the collected data from
Google Form.
Prepared by: Dr. Chang Yun Fah 51
Collect Data using Google Form
To build the Google
Form, click on the Form
tab and select Edit Form.
An empty Google Form
will appeared and create
your own desired form.
Refer to “The ultimate
guide to google sheets”
for the details on creating
google form.
Prepared by: Dr. Chang Yun Fah 52
Syncing data between Google
Sheet and Excel
Create a google sheet file “Master 1” and
place it in My Drive.
Write Unit
Price & Total
Price
headers
Write
VLookup
function in
E2
3
In F2,
write
4
=E2*D2
=VLOOKUP(lookup_value,table_array,col_index_num,range_lookup)
Copy E2 & =VLOOKUP(B2,[Supermarket_price.xlsx]Price!$A$2:$B$4568,2,FALSE)
F2 and paste
them to the
remaining Prepared by: Dr. Chang Yun Fah 63
cells
Combine Purchase data and Distance data based on Product ID and
Shop ID.
Write Concat &
distance as 2
headers
1
Concatenate
customer ID and
shop ID in
Purchase file
Concatenate
customer ID and =TEXTJOIN(“delimiter”,ignore_empty,text1,text2)
shop ID in =TEXTJOIN(“A”,TRUE,A2,C2)
Distance file
3
Write VLookup
function in H2
Copy G2 & H2
and paste them to
the remaining cells
=VLOOKUP(G2,[Supermarket_distances.xlsx]Sheet1!$C$2:$D$301831,2,FALSE)
Prepared by: Dr. Chang Yun Fah 64
Merge/Combine worksheets using Query
Data tab
2
Get Data
5
Combine Queries
if there are
connections or 6
Launch Query
Editor
4 5
3
Merge
6
Select the two
merged files, 7
define the
8
reference
columns, select
Prepared by: Dr. Chang Yun Fah 65
join kind
Merge/Combine worksheets using Query
Click on the
arrows of created
column
13
Select the 9
columns from
second file to
merge into first file
Select Expand or 11
Aggregate of the 10
second file
12
OK
Click Pivot 2
Table
3
Select range
A1:C10
4
Create pivot
table in new 5
worksheet
Click OK
Prepared by: Dr. Chang Yun Fah 68
Put your cursor
in any cell within
pivot table
background
Select Users as
Rows 5
1
Select Perms as
Column
Select Value as
Sum of Values 3
Double click on 2 4
Grand Total cell
“186” to get
database table
Select “Multiple 3
consolidation
ranges” & Next
2
Select Create a 4
single page field
for me” & click
Next
Select range 2
G1:J4 and click
5
Add
Click Finish,
follow by No
5
Prepared by: Dr. Chang Yun Fah 70
De-pivot Table
Double click on
Grand Total cell
“186 “
Change the
headers’ name
and delete
Page1 column 1
Open new
worksheet
3 4
Click
5 6
Consolidate 6 6
Select type of
Function
1
2
Repeat to add
more worksheets
Check labels 3 4
5
boxes
Click OK
Select 1
Column &
OK
Click Ungroup
to remove the
grouping 4
3
Click Subtotal
Select the
grouping criteria
5
Select the
functions used &
variable
6
6
Select “replace” &
“summary”
checkboxes & OK Prepared by: Dr. Chang Yun Fah 77
The subtotals are
inserted as new
rows below each
observation group.
Eg: 803.968 is the
sum of total prices
for customer 1
2
3
Click Data tap
Click Remove
Duplicates
Select features
4
5
Click OK
Identify potential
outliers
Verify source
of variation
1 n
i
2
var iance x x minimum
mean
maximum
n 1 i 1 Std dv
std _ dv
se(mean)
n 3
6n n 1
1 1 2
skewness xi x xi x
3 2
se skewness
n n n 2 n 1 n 3
1 1 2
2
n2 1
kurtosis xi x xi x 3 se( kurtosis ) 2 se skewness
4
n n n 3 n 5
Normal distribution hasby:
Prepared skewness
Dr. Chang Yunof
Fah0 and kurtosis of 3 100
Add Analysis ToolPak to Excel
4 8
3 5
100
50 32
1 0 0 0 0 0 0 0 1
0
5640 145202340032280411605004058920678007668085560 More
Salary
5 9
6
7
5
4
6
8
5
6
100
80
60
40 32
20
1 0 0 0 0 0 0 0 1
0
5640 14520 23400 32280 41160 50040 58920 67800 76680 85560 More
Salary
Outlier (1.5xIQR,3.0xIQR)
Whisker
3rd quartile (75%)
*
Inter-quartile
range (IQR)
1st quartile
(25%)
Median
Outlier (-
Minimum value
1.5xIQR,- (including outlier &
3.0xIQR) extreme values, if exists)
Extreme value
(>-3.0xIQR) Prepared by: Dr. Chang Yun Fah 112
Excel:
Create 2-D Line chart with Markers.
Add data series Q1 (25%) from N6, Median (50%) from N7, and Q3 (75%) from N8.
Convert median to scatter plot with Markers
Create box: select series Q3 and go to Layout menu and choose Up/Down Bars tab.
Delete the Marker for Q3 and Q1.
Create whiskers: select series Q3 (and Q1) and go to Layout menu to choose Error
Bars tab. Click on the More Error Bars Options to get the dialog box. Change the
vertical error bars to Plus direction, Cap end style. For the error amount, choose
Custom and click on the Specify Value button to get a small dialog box. Select the
Positive Error Value from N11.
Create median line: select Median series, choose Layout menu and select Error Bars
tab to choose More Error Bars Options. From Horizontal Error Bars dialog box, choose
Both direction, No Cap end style and fixed the value of error amount as 0.2.
Prepared by: Dr. Chang Yun Fah 113
Outlier
90000.00
80000.00
70000.00 Outlier
60000.00
50000.00
40000.00
30000.00
20000.00
10000.00
0.00
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00
Gender
160 148
1.0
0.8
0.6
0.4
•The points lie approximately along a
0.2
straight line.
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Observed Cum Prob
Heavy/Long-tailed
distribution: The points
show a sharp upward
and downward curve at
both extremes.
Negative/Left
skewed
Positive/Right
skewed
Prepared by: Dr. Chang Yun Fah 123
Normal Probability (QQ) Plot
0.8
0.6
0.4
0.2
0
-4 -3 -2 -1 0 1 2 3 4