0% found this document useful (0 votes)
45 views47 pages

INF30036 Lecture4

1. Data preparation methods are used to clean, transform, and validate data before modeling to address issues like inconsistent formats, missing values, outliers, and other problems. 2. Common data preparation methods include handling inconsistent formats, imputing missing values, identifying and managing outliers, and other techniques like listwise deletion, mean/median imputation, and random sampling. 3. When performing predictive analytics, the original data set is typically divided into training, validation, and test partitions to assess how well models generalize to new data and avoid overfitting or underfitting issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views47 pages

INF30036 Lecture4

1. Data preparation methods are used to clean, transform, and validate data before modeling to address issues like inconsistent formats, missing values, outliers, and other problems. 2. Common data preparation methods include handling inconsistent formats, imputing missing values, identifying and managing outliers, and other techniques like listwise deletion, mean/median imputation, and random sampling. 3. When performing predictive analytics, the original data set is typically divided into training, validation, and test partitions to assess how well models generalize to new data and avoid overfitting or underfitting issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Data issues

Data exploration
Data visualisation
Part 1
Data visualisation
What is Visualization?

Visual representations of data that reinforce human cognition

3
Visual exploration

• Visualization:
> “The use of computer-supported, interactive,
visual representations of data to amplify
cognition.”
> Goal: discovery, decision making, explanation

4
Examples of Visualization

5
The iris dataset

6
Boxplots

Boxplots – Used to summarize quantitative/numeric data

7
Box plots in detail

8
Common Graphical Parameters

9
Saving graphs

10
Pie chart

11
Plot

12
xyplot

13
In class exercise ( 10 min)

v Compare 2D scatterplots for iris dataset


(Petal.Length and Petal.Width) Summarize your
findings

14
Histogram

15
Density plots

16
Multiple density plots

17
Scatterplot mix

18
Scatterplot matrix

19
Part 2
Ggplot2
Diamonds data set

21
ggplot Fundamentals

vggplot() is the basic function

vgeom_*() creates a graph layer

v geom_histogram()
v geom_point()

vaes() defines an “aesthetic” either globally or by layer

22
ggplot2- Layering

23
Histogram

24
Density plots

25
Scatterplot

26
Scatterplot - Segmentation

27
Scatterplot - Segmentation

28
Separating segments - 1

29
Separating segments - 2

ggplot(diamonds, aes(x=carat, y=price,


color=clarity)) + geom_point() + facet_wrap(~ 30
color)
More segmentation

ggplot(diamonds, aes(x=carat, y=price)) +


geom_point(aes(color=clarity)) +
facet_grid(clarity~ color) 31
Protocol for data exploration

1. Look at the first few rows of data with column names


– use the head(). Identify the predicted and predictor
columns. Ex: Survived is the predictor column and
2. Next identify data types of each column using the
str(). Specifically give attention to numerical and
factor variables
3. Explore relationships between predictor/predicted
columns and identify strong vs. weak predictors
4. Factor variables could be used to segment dataset
using colors
5. Summarize findings

32
Part 3
Data prep (structured)
Data preparation methods
Data in its raw, original form is typically not ready to be analysed
and modelled.

Data sets are often merged and contain inconsistent formats,


missing data, miscoded data, incorrect data, and duplicate data.

The data needs to be analysed, “cleansed,” transformed, and


validated before model creation.

This step can take a significant


amount of time in the process but is vital to the process. Some
common methods for handling these problems are discussed
below.
34
Data preparation methods – inconsistent formats

Data in a single column must have consistent formats. When data sets are
merged together, this can result in the same data with different formats. For
example, dates can be problematic.

A data column cannot have a date format as mm/dd/yyyy and mm/dd/yy.


The data must be corrected to have consistent formats.

35
Data preparation methods – Missing values

Missing data is a data value that is not stored for a variable in the
observation of interest. There are many reasons that the value may be
missing.

The data may not have been available, or the value may have just been
accidently omitted. When analysing data, first determine the pattern of
the missing data.
There are three pattern types:

missing completely at random (MCAR), missing at random (MAR), and missing not
at random (MNAR). Missing completely at random occurs when there is no pattern
in themissing data for any variable. Missing at random occurs when there is a pattern
in the missing data but not on the primary dependent variables.

36
Data preparation methods – Outliers
An outlier is a data value that is an abnormal distance from the other data
values in the data set. Outliers can be visually identified by constructing
histograms or box plots and looking for values that are too high or too low.
There are five common methods to manage the outliers:

1. Remove the outliers from the modelling data.


2. Separate the outliers and create separate models.
3. Transform the outliers so that they are no longer outliers
4. Bin the data.
5. Leave the outliers in the data.

37
Data preparation methods – Missing values

There are three pattern types:

• missing completely at random (MCAR),


• missing at random (MAR) and,
• missing not at random (MNAR)

Missing completely at random occurs when there is no pattern in the


missing data for any variable. Missing at random occurs when there is a
pattern in the missing data but not on the primary dependent variables.

38
Data preparation methods – other methods
In predictive modelling depending on the type of model being used,
missing values may result in analysis problems. There are two strategies for
dealing with missing values, listwise deletion or column deletion and
imputation. Listwise, deletion involves deleting the row(or record) from the
data set.

If there are just a few missing values, this may be an appropriate approach.
A smaller data set can weaken the predictive power of the model. Column
deletion removes any variable that contains missing values.

39
Data preparation methods – other methods

Deleting a variable that contains just a few missing values is not


recommended. The second strategy and more advantageous is imputation.
Imputation is changing missing data value to a value that represents a
reasonable value. The common imputation methods are:

• Replace the missing values with another constant value. Typically, for a
numeric variable, 0 is the constant value entered. However, this can be
problematic; for example, replacing age with a 0 does not make sense,
and for a categorical variable such as gender, replacing a missing value
with a constant such as F. This works well when the missing value is
completely at random (MCAR).

40
Data preparation methods – other methods
Replace missing numeric values with the mean (average) or median
(middle value in the variable).

Replacing missing values with the mean of the variable is a common and
easy method plus it is likely to impair the model’s predictability as on
average values should approach the mean.

However, if there are many missing values for a particular variable,


replacing with the mean can cause problems as more value in the mean
will cause a spike in the variable’s distribution and smaller standard
deviation.

If this is the case, replacing with the median may be a better approach.

41
Data preparation methods – other methods

• Replace categorical values with the mode (the most frequent value) as
there is no mean or median. Numeric missing values could also be
replaced by the mode.

• Replace the missing value by randomly selecting a value from missing


value’s own distribution. This is preferred over mean imputation,
however, not as simple.

42
Part 4
Data Partitioning
considerations
Data Sets and Partitioning

In predictive analytics to assess how well your model behaves when


applied to new data, the original data set is divided into multiple
partitions: training, validation, and optionally test. Partitioning is
normally performed randomly to protect against any bias, but stratified
sampling can be performed. The training partition is used to train or
build the model.

For example, in regression analysis, the training set is used to fit the
linear regression model. In neural networks, the training set is used to
obtain the model’s network weights.

44
Data Sets and Partitioning

After fitting the model on the training partition, the performance of the
model is tested on the validation partition. The best-fitting model is
most often chosen based on its accuracywith the validation data set.
After selecting the best-fit model, it is a good idea to check the
performance of the model against the test partition which was not used
in either training or in validation. This is often the case when dealing
with big data.

45
Data Sets and Partitioning –
overfitting and underfitting issues

It is important to find the correct level of model complexity. A model


that is that is not complex enough, referred to as underfit, may lack the
flexibility to accurately represent the data.

This can be caused by a training set that is too small. When the model is
too complex or overfit, it can be influenced by random noise. This can be
caused by a training set that is too large. Often analysts will partition the
data set early in the data preparation process.

46
Thank You for your attention.

Q&A

You might also like