0% found this document useful (0 votes)

45 views47 pages

INF30036 Lecture4

1. Data preparation methods are used to clean, transform, and validate data before modeling to address issues like inconsistent formats, missing values, outliers, and other problems. 2. Common data preparation methods include handling inconsistent formats, imputing missing values, identifying and managing outliers, and other techniques like listwise deletion, mean/median imputation, and random sampling. 3. When performing predictive analytics, the original data set is typically divided into training, validation, and test partitions to assess how well models generalize to new data and avoid overfitting or underfitting issues.

Uploaded by

Yehan Abayasinghe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views47 pages

INF30036 Lecture4

Uploaded by

Yehan Abayasinghe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Data issues

Data exploration
Data visualisation
Part 1
Data visualisation
What is Visualization?

Visual representations of data that reinforce human cognition

3
Visual exploration

• Visualization:
> “The use of computer-supported, interactive,
visual representations of data to amplify
cognition.”
> Goal: discovery, decision making, explanation

4
Examples of Visualization

5
The iris dataset

6
Boxplots

Boxplots – Used to summarize quantitative/numeric data

7
Box plots in detail

8
Common Graphical Parameters

9
Saving graphs

10
Pie chart

11
Plot

12
xyplot

13
In class exercise ( 10 min)

v Compare 2D scatterplots for iris dataset

(Petal.Length and Petal.Width) Summarize your
findings

14
Histogram

15
Density plots

16
Multiple density plots

17
Scatterplot mix

18
Scatterplot matrix

19
Part 2
Ggplot2
Diamonds data set

21
ggplot Fundamentals

vggplot() is the basic function

vgeom_*() creates a graph layer

v geom_histogram()
v geom_point()

vaes() defines an “aesthetic” either globally or by layer

22
ggplot2- Layering

23
Histogram

24
Density plots

25
Scatterplot

26
Scatterplot - Segmentation

27
Scatterplot - Segmentation

28
Separating segments - 1

29
Separating segments - 2

ggplot(diamonds, aes(x=carat, y=price,

color=clarity)) + geom_point() + facet_wrap(~ 30
color)
More segmentation

ggplot(diamonds, aes(x=carat, y=price)) +

geom_point(aes(color=clarity)) +
facet_grid(clarity~ color) 31
Protocol for data exploration

1. Look at the first few rows of data with column names

– use the head(). Identify the predicted and predictor
columns. Ex: Survived is the predictor column and
2. Next identify data types of each column using the
str(). Specifically give attention to numerical and
factor variables
3. Explore relationships between predictor/predicted
columns and identify strong vs. weak predictors
4. Factor variables could be used to segment dataset
using colors
5. Summarize findings

32
Part 3
Data prep (structured)
Data preparation methods
Data in its raw, original form is typically not ready to be analysed
and modelled.

Data sets are often merged and contain inconsistent formats,

missing data, miscoded data, incorrect data, and duplicate data.

The data needs to be analysed, “cleansed,” transformed, and

validated before model creation.

This step can take a significant

amount of time in the process but is vital to the process. Some
common methods for handling these problems are discussed
below.
34
Data preparation methods – inconsistent formats

Data in a single column must have consistent formats. When data sets are
merged together, this can result in the same data with different formats. For
example, dates can be problematic.

A data column cannot have a date format as mm/dd/yyyy and mm/dd/yy.

The data must be corrected to have consistent formats.

35
Data preparation methods – Missing values

Missing data is a data value that is not stored for a variable in the
observation of interest. There are many reasons that the value may be
missing.

The data may not have been available, or the value may have just been
accidently omitted. When analysing data, first determine the pattern of
the missing data.
There are three pattern types:

missing completely at random (MCAR), missing at random (MAR), and missing not
at random (MNAR). Missing completely at random occurs when there is no pattern
in themissing data for any variable. Missing at random occurs when there is a pattern
in the missing data but not on the primary dependent variables.

36
Data preparation methods – Outliers
An outlier is a data value that is an abnormal distance from the other data
values in the data set. Outliers can be visually identified by constructing
histograms or box plots and looking for values that are too high or too low.
There are five common methods to manage the outliers:

1. Remove the outliers from the modelling data.

2. Separate the outliers and create separate models.
3. Transform the outliers so that they are no longer outliers
4. Bin the data.
5. Leave the outliers in the data.

37
Data preparation methods – Missing values

There are three pattern types:

• missing completely at random (MCAR),

• missing at random (MAR) and,
• missing not at random (MNAR)

Missing completely at random occurs when there is no pattern in the

missing data for any variable. Missing at random occurs when there is a
pattern in the missing data but not on the primary dependent variables.

38
Data preparation methods – other methods
In predictive modelling depending on the type of model being used,
missing values may result in analysis problems. There are two strategies for
dealing with missing values, listwise deletion or column deletion and
imputation. Listwise, deletion involves deleting the row(or record) from the
data set.

If there are just a few missing values, this may be an appropriate approach.
A smaller data set can weaken the predictive power of the model. Column
deletion removes any variable that contains missing values.

39
Data preparation methods – other methods

Deleting a variable that contains just a few missing values is not

recommended. The second strategy and more advantageous is imputation.
Imputation is changing missing data value to a value that represents a
reasonable value. The common imputation methods are:

• Replace the missing values with another constant value. Typically, for a
numeric variable, 0 is the constant value entered. However, this can be
problematic; for example, replacing age with a 0 does not make sense,
and for a categorical variable such as gender, replacing a missing value
with a constant such as F. This works well when the missing value is
completely at random (MCAR).

40
Data preparation methods – other methods
Replace missing numeric values with the mean (average) or median
(middle value in the variable).

Replacing missing values with the mean of the variable is a common and
easy method plus it is likely to impair the model’s predictability as on
average values should approach the mean.

However, if there are many missing values for a particular variable,

replacing with the mean can cause problems as more value in the mean
will cause a spike in the variable’s distribution and smaller standard
deviation.

If this is the case, replacing with the median may be a better approach.

41
Data preparation methods – other methods

• Replace categorical values with the mode (the most frequent value) as
there is no mean or median. Numeric missing values could also be
replaced by the mode.

• Replace the missing value by randomly selecting a value from missing

value’s own distribution. This is preferred over mean imputation,
however, not as simple.

42
Part 4
Data Partitioning
considerations
Data Sets and Partitioning

In predictive analytics to assess how well your model behaves when

applied to new data, the original data set is divided into multiple
partitions: training, validation, and optionally test. Partitioning is
normally performed randomly to protect against any bias, but stratified
sampling can be performed. The training partition is used to train or
build the model.

For example, in regression analysis, the training set is used to fit the
linear regression model. In neural networks, the training set is used to
obtain the model’s network weights.

44
Data Sets and Partitioning

After fitting the model on the training partition, the performance of the
model is tested on the validation partition. The best-fitting model is
most often chosen based on its accuracywith the validation data set.
After selecting the best-fit model, it is a good idea to check the
performance of the model against the test partition which was not used
in either training or in validation. This is often the case when dealing
with big data.

45
Data Sets and Partitioning –
overfitting and underfitting issues

It is important to find the correct level of model complexity. A model

that is that is not complex enough, referred to as underfit, may lack the
flexibility to accurately represent the data.

This can be caused by a training set that is too small. When the model is
too complex or overfit, it can be influenced by random noise. This can be
caused by a training set that is too large. Often analysts will partition the
data set early in the data preparation process.

46
Thank You for your attention.

Q&A

Data Wrangling
No ratings yet
Data Wrangling
30 pages
C-1540 Illustrated Parts Catalog - Mastercopy - Upto Serial No. Trx1540slomfc1558
No ratings yet
C-1540 Illustrated Parts Catalog - Mastercopy - Upto Serial No. Trx1540slomfc1558
643 pages
Analemmatic Sundial PDF Generator
0% (1)
Analemmatic Sundial PDF Generator
37 pages
Data Quality
100% (2)
Data Quality
16 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
8 Modularization Techniques
100% (2)
8 Modularization Techniques
34 pages
Data Analytics Course Session 1-5
100% (1)
Data Analytics Course Session 1-5
252 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Data Preparation PDF
No ratings yet
Data Preparation PDF
71 pages
Initial Data Analysis
No ratings yet
Initial Data Analysis
38 pages
Insy662 - f23 - Week 1
No ratings yet
Insy662 - f23 - Week 1
21 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Data Preparation .1
No ratings yet
Data Preparation .1
37 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
CH 2
No ratings yet
CH 2
36 pages
1.3 Data Quality
No ratings yet
1.3 Data Quality
6 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
BA UNIT-3 - Part 1
No ratings yet
BA UNIT-3 - Part 1
4 pages
Da Laqs Saqs
No ratings yet
Da Laqs Saqs
23 pages
2 - Preprocessing
No ratings yet
2 - Preprocessing
74 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Apple Iphone 7 Teardown
No ratings yet
Apple Iphone 7 Teardown
37 pages
Data Preparation
No ratings yet
Data Preparation
17 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
No ratings yet
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
38 pages
Module 3 Data Preparation
No ratings yet
Module 3 Data Preparation
33 pages
Unit 1
No ratings yet
Unit 1
26 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
Week 4 DMM
No ratings yet
Week 4 DMM
21 pages
UNIT02
No ratings yet
UNIT02
41 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Unit 2
No ratings yet
Unit 2
19 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Chapter2 BI
No ratings yet
Chapter2 BI
77 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Measure Phase and Data Collection
No ratings yet
Measure Phase and Data Collection
55 pages
03 Data Science Process - Fall 23-24
No ratings yet
03 Data Science Process - Fall 23-24
38 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
28 pages
1preparing Data
No ratings yet
1preparing Data
6 pages
AZ 104T00A ENU TrainerPrepGuide
100% (1)
AZ 104T00A ENU TrainerPrepGuide
28 pages
KJWDH
No ratings yet
KJWDH
4 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
Unit II
No ratings yet
Unit II
13 pages
Ai - Foundations of Machine Learning III
No ratings yet
Ai - Foundations of Machine Learning III
98 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Unofficial Elegoo Saturn Resin Setting
No ratings yet
Unofficial Elegoo Saturn Resin Setting
7 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Module 4
No ratings yet
Module 4
47 pages
DS&ML 4
No ratings yet
DS&ML 4
9 pages
936-Module 04 PPT
No ratings yet
936-Module 04 PPT
15 pages
JVC Lt-48k770 Led Television
No ratings yet
JVC Lt-48k770 Led Television
112 pages
Delta Ia-Hmi Dop300 Diastudio SQL Am en 20241223
No ratings yet
Delta Ia-Hmi Dop300 Diastudio SQL Am en 20241223
25 pages
Coa Lab Manual Bcs 352
No ratings yet
Coa Lab Manual Bcs 352
73 pages
How Hybrid Working From Home Works Out
No ratings yet
How Hybrid Working From Home Works Out
52 pages
CSE211 MCQ S TXT 426394401 CSE211 MCQ S
No ratings yet
CSE211 MCQ S TXT 426394401 CSE211 MCQ S
18 pages
Hisagent User
No ratings yet
Hisagent User
432 pages
Final All Codes of Paul Hudson On Swift
No ratings yet
Final All Codes of Paul Hudson On Swift
34 pages
University of Cebu - Banilad: Test Case
No ratings yet
University of Cebu - Banilad: Test Case
8 pages
Manual Smar Tt301
100% (1)
Manual Smar Tt301
58 pages
Avoset Pump User Manual US 15132 048 0003 UM
No ratings yet
Avoset Pump User Manual US 15132 048 0003 UM
88 pages
Cs It - Post Gate 2023 Iit Made Easy
No ratings yet
Cs It - Post Gate 2023 Iit Made Easy
59 pages
Turning Points - IBDP Mathematics - Applications and Interpretation SL FE2021 - Kognity
No ratings yet
Turning Points - IBDP Mathematics - Applications and Interpretation SL FE2021 - Kognity
10 pages
Bca Syllabus Sem IV
No ratings yet
Bca Syllabus Sem IV
15 pages
Polynomials Test Paper
No ratings yet
Polynomials Test Paper
3 pages
Additional Complex Number Problems 2 PDF
No ratings yet
Additional Complex Number Problems 2 PDF
2 pages
Lecture3 Interface
No ratings yet
Lecture3 Interface
41 pages
CPE/EE 421/521 Fall 2004 Chapter 1 - The Microcomputer: Dr. Rhonda Kay Gaede
No ratings yet
CPE/EE 421/521 Fall 2004 Chapter 1 - The Microcomputer: Dr. Rhonda Kay Gaede
6 pages
Video Streaming Gateway
No ratings yet
Video Streaming Gateway
10 pages
Chapter-5-The Internet and Its Uses
No ratings yet
Chapter-5-The Internet and Its Uses
17 pages
Migrating A Survey From LimeSurvey To Qualtrics
No ratings yet
Migrating A Survey From LimeSurvey To Qualtrics
11 pages
07 Performance Evaluation of Data Center Network With Network Micro-Segmentation
No ratings yet
07 Performance Evaluation of Data Center Network With Network Micro-Segmentation
6 pages
Cmos Design Using LT Spice 6
No ratings yet
Cmos Design Using LT Spice 6
7 pages
Milestone 2
No ratings yet
Milestone 2
5 pages
Unlocked Games For School
No ratings yet
Unlocked Games For School
2 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Core Concepts in Statistical Learning
From Everand
Core Concepts in Statistical Learning
Tushar Gulati
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet