INF30036 Lecture4
INF30036 Lecture4
Data exploration
Data visualisation
Part 1
Data visualisation
What is Visualization?
3
Visual exploration
• Visualization:
> “The use of computer-supported, interactive,
visual representations of data to amplify
cognition.”
> Goal: discovery, decision making, explanation
4
Examples of Visualization
5
The iris dataset
6
Boxplots
7
Box plots in detail
8
Common Graphical Parameters
9
Saving graphs
10
Pie chart
11
Plot
12
xyplot
13
In class exercise ( 10 min)
14
Histogram
15
Density plots
16
Multiple density plots
17
Scatterplot mix
18
Scatterplot matrix
19
Part 2
Ggplot2
Diamonds data set
21
ggplot Fundamentals
v geom_histogram()
v geom_point()
22
ggplot2- Layering
23
Histogram
24
Density plots
25
Scatterplot
26
Scatterplot - Segmentation
27
Scatterplot - Segmentation
28
Separating segments - 1
29
Separating segments - 2
32
Part 3
Data prep (structured)
Data preparation methods
Data in its raw, original form is typically not ready to be analysed
and modelled.
Data in a single column must have consistent formats. When data sets are
merged together, this can result in the same data with different formats. For
example, dates can be problematic.
35
Data preparation methods – Missing values
Missing data is a data value that is not stored for a variable in the
observation of interest. There are many reasons that the value may be
missing.
The data may not have been available, or the value may have just been
accidently omitted. When analysing data, first determine the pattern of
the missing data.
There are three pattern types:
missing completely at random (MCAR), missing at random (MAR), and missing not
at random (MNAR). Missing completely at random occurs when there is no pattern
in themissing data for any variable. Missing at random occurs when there is a pattern
in the missing data but not on the primary dependent variables.
36
Data preparation methods – Outliers
An outlier is a data value that is an abnormal distance from the other data
values in the data set. Outliers can be visually identified by constructing
histograms or box plots and looking for values that are too high or too low.
There are five common methods to manage the outliers:
37
Data preparation methods – Missing values
38
Data preparation methods – other methods
In predictive modelling depending on the type of model being used,
missing values may result in analysis problems. There are two strategies for
dealing with missing values, listwise deletion or column deletion and
imputation. Listwise, deletion involves deleting the row(or record) from the
data set.
If there are just a few missing values, this may be an appropriate approach.
A smaller data set can weaken the predictive power of the model. Column
deletion removes any variable that contains missing values.
39
Data preparation methods – other methods
• Replace the missing values with another constant value. Typically, for a
numeric variable, 0 is the constant value entered. However, this can be
problematic; for example, replacing age with a 0 does not make sense,
and for a categorical variable such as gender, replacing a missing value
with a constant such as F. This works well when the missing value is
completely at random (MCAR).
40
Data preparation methods – other methods
Replace missing numeric values with the mean (average) or median
(middle value in the variable).
Replacing missing values with the mean of the variable is a common and
easy method plus it is likely to impair the model’s predictability as on
average values should approach the mean.
If this is the case, replacing with the median may be a better approach.
41
Data preparation methods – other methods
• Replace categorical values with the mode (the most frequent value) as
there is no mean or median. Numeric missing values could also be
replaced by the mode.
42
Part 4
Data Partitioning
considerations
Data Sets and Partitioning
For example, in regression analysis, the training set is used to fit the
linear regression model. In neural networks, the training set is used to
obtain the model’s network weights.
44
Data Sets and Partitioning
After fitting the model on the training partition, the performance of the
model is tested on the validation partition. The best-fitting model is
most often chosen based on its accuracywith the validation data set.
After selecting the best-fit model, it is a good idea to check the
performance of the model against the test partition which was not used
in either training or in validation. This is often the case when dealing
with big data.
45
Data Sets and Partitioning –
overfitting and underfitting issues
This can be caused by a training set that is too small. When the model is
too complex or overfit, it can be influenced by random noise. This can be
caused by a training set that is too large. Often analysts will partition the
data set early in the data preparation process.
46
Thank You for your attention.
Q&A