Improve Model Accuracy With Data Pre-Processing
Improve Model Accuracy With Data Pre-Processing
Search...
Data preparation can make or break the predictive ability of your model.
In Chapter 3 of their book Applied Predictive Modeling, Kuhn and Johnson introduce the process of data
preparation. They refer to it as the addition, deletion or transformation of training set data.
In this post you will discover the data pre-process steps that you can use to improve the predictive ability of your
models.
Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials
and the Python source code files for all examples.
I Love Spreadsheets
Photo by Craig Chew-Moulding, some rights reserved
Data Preparation
You must pre-process your raw data before you model your problem. The specific preparation may depend on the
data that you have available and the machine learning algorithms you want to use.
Sometimes, pre-processing of data can lead to unexpected improvements in model accuracy. This may be because
a relationship in the data has been simplified or unobscured.
Data preparation is an important step and you should experiment with data pre-processing steps that are
appropriate for your data to see if you can get that desirable boost in model accuracy.
There are three types of pre-processing you can consider for your data:
We will dive into each of these three types of pre-process and review some specific examples of operations that
you can perform.
Click to sign-up and also get a free PDF Ebook version of the course.
Dummy Attributes: Categorical attributes can be converted into n-binary attributes, where n is the number of
categories (or levels) that the attribute has. These denormalized or decomposed attributes are known as
dummy attributes or dummy variables.
Transformed Attribute: A transformed variation of an attribute can be added to the dataset in order to allow a
linear method to exploit possible linear and non-linear relationships between attributes. Simple transforms like
log, square and square root can be used.
Missing Data: Attributes with missing data can have that missing data imputed using a reliable method, such
as k-nearest neighbors.
Projection: Training data can be projected into lower dimensional spaces, but still characterize the inherent
relationships in the data. A popular approach is Principal Component Analysis (PCA) where the principal
components found by the method can be taken as a reduced set of input attributes.
Spatial Sign: A spatial sign projection of the data will transform data onto the surface of a multidimensional
sphere. The results can be used to highlight the existence of outliers that can be modified or removed from the
data.
Correlated Attributes: Some algorithms degrade in importance with the existence of highly correlated
attributes. Pairwise attributes with high correlation can be identified and the most correlated attributes can be
removed from the data.
Centering: Transform the data so that it has a mean of zero and a standard deviation of one. This is typically
called data standardization.
Scaling: A standard scaling transformation is to map the data from the original scale to a scale between zero
and one. This is typically called data normalization.
Remove Skew: Skewed data is data that has a distribution that is pushed to one side or the other (larger or
smaller values) rather than being normally distributed. Some methods assume normally distributed data and
can perform better if the skew is removed. Try replacing the attribute with the log, square root or inverse of the
values.
Box-Cox: A Box-Cox transform or family of transforms can be used to reliably adjust data to remove skew.
Binning: Numeric data can be made discrete by grouping values into bins. This is typically called data
discretization. This process can be performed manually, although is more reliable if performed systematically
and automatically using a heuristic that makes sense in the domain.
Summary
Data pre-process is an important step that can be required to prepare raw data for modeling, to meet the
expectations of data for a specific machine learning algorithms, and can give unexpected boosts in model accuracy.
Adding Attributes
Removing Attributes
Transforming Attributes
The next time you are looking for a boost in model accuracy, consider what new perspectives you can engineer on
your data for your models to explore and exploit.
Get Your Data Ready For How to Use The Pre-Trained Feature Selection to Improve
Machine Learning in R with… VGG Model to Classify… Accuracy and Decrease…
What you need to know before How to use Data Scaling How to Improve Deep Learning
you get started: A… Improve Deep Learning Model… Model Robustness by…
REPLY
Juhyoung Lee July 19, 2016 at 3:54 pm #
Can you explain more detail about concept of attribute and dummy attribute?
REPLY
Jason Brownlee July 20, 2016 at 5:17 am #
You can take a categorical attribute like “color” with the values “red” and “blue and turn it into two binary attributes
has_red and has_blue.
REPLY
kay December 12, 2016 at 8:23 am #
It was a great article, although i had a question suppose there is a data set consisting means, modes, min’s,
max’s etc. How can we represent all those values on a common scale,or genralize the values, for example let’s say
mean of heights in a group of people is x, and mode is y, and min value is z, and there is group 2 with same data,
can the values be represented on a common scale?
REPLY
Jason Brownlee December 13, 2016 at 8:04 am #
Hi Kay,
You can scale each column (data type or feature) separately to a range of 0-1.
Where x is a given value and min and max are the limits of values on the column.
REPLY
Jan July 20, 2017 at 8:25 pm #
Regards!
REPLY
Jason Brownlee July 21, 2017 at 9:33 am #
Absolutely.
REPLY
Efatathios Chatzikyriakidis April 20, 2020 at 7:28 am #
Hi!
You can add in the list also aggregations features, or statistical features in general (5-number summaries),
outliers removal.
REPLY
Jason Brownlee April 20, 2020 at 7:37 am #
Great tip.
REPLY
Shabnam December 1, 2017 at 6:26 am #
REPLY
Jason Brownlee December 1, 2017 at 7:44 am #
REPLY
Shabnam December 1, 2017 at 6:31 am #
I have a question about scaling. If we have a binary classification (0,1). Is it better to keep it as it is or
change it to (-1,1) for example? Or does it depends on the data?
When I read posts about machine learning, I am not sure, if the notes are always true or not.
I noticed that in many cases, the response depends on the type of data. How can we say if a note/rule/point is
dependent to a data type?
REPLY
Jason Brownlee December 1, 2017 at 7:44 am #
When you say “Correlated Attributes: Some algorithms degrade in importance with the existence of highly
correlated attributes. Pairwise attributes with high correlation can be identified and the most correlated attributes can
be removed from the data.”
I mean: I usually remove those attributes which are not correlated and keep those highly correlated… this can
explain why I couldn’t optimise my models too much.
Thanks!
REPLY
Jason Brownlee December 11, 2017 at 5:33 am #
Multiple correlated input features do mess up most models, at worst resulting in worse skill, at best,
creating a lot of redundancy in the model.
REPLY
Mohammad Ehtasham Billah February 5, 2018 at 10:12 am #
Among all of these steps which are the most important ones? e.g. If I apply PCA for dimensionality reduction
and create just two new features can I expect that other problems in the data (e.g.outliers, multicollinearity, skewed
distribution) will no longer exist?
for transform attribute in add data attribute section, how can I choose which transformation is best among log, square
and square root? Can I apply each of those transformations and keep all of them in the dataset. I feel like it may
cause redundancy and multicollinearity.
REPLY
Jason Brownlee February 5, 2018 at 2:52 pm #
PCA with outliers might cause problems, better to remove them first.
Try maintaining multiple “views” of the data and try modes on each to explore the best combination of
view/model/config.
REPLY
Mohammad Ehtasham Billah February 8, 2018 at 6:18 am #
Hi Jason,
For pairwise attributes with high correlation, what is the accepted level? The correlation can be -1 to +1.
REPLY
Jason Brownlee February 8, 2018 at 8:32 am #
Perhaps below 0.5. I’d recommend testing and see what works best for your data and
algorithms.
REPLY
maunish September 19, 2019 at 2:04 am #
Hi jason,
I have participated in a Kaggle competition in which i have to classify forest cover type ,
i tried everything i used like stacking various models , feature engineering , feature extraction
but my model accuracy is not increasing above 80%.
i also found that 2 types of cover type are really hard to separate so i tried to build a model to separate this 2 cover
types.
nothing is working so i am frustrated it feels that i am not knowing something that others know.
it would be really helpful if you could give an insight , i am working on this for 2 weeks and very frustrated now.
and sorry for asking such a stupid question and long question.
REPLY
Jason Brownlee September 19, 2019 at 6:04 am #
And here:
https://machinelearningmastery.com/start-here/#better
REPLY
rahul June 8, 2020 at 8:34 am #
Hi Jason,
I have a .csv file with 10 columns and roughly 6000 rows. My data is represented in the form of only 0 and 1. Each
row represents a timeframe of a video.
Let’s say I want to bring the number of rows down from 6000 to 1000 rows without loosing information. What method
is reliable in my case? And how it can be done?
REPLY
Jason Brownlee June 8, 2020 at 1:19 pm #
REPLY
Priya April 30, 2021 at 7:31 pm #
Hi Sir,
As kNN is distance based algorithm so data normalization may have an impact on this algorithm. But is it possible
that normalization, negatively impact model accuracy with kNN.
(I am getting RMSE=50 without Normalization, and RMSE=70 with normalization in kNN algorithm)
Is it possible or I am doing some logical mistake?
REPLY
Jason Brownlee May 1, 2021 at 6:04 am #
REPLY
Priya May 1, 2021 at 2:40 pm #
REPLY
Jason Brownlee May 2, 2021 at 5:27 am #
You’re welcome.
REPLY
John Rustam August 13, 2021 at 2:09 am #
i like very much what you write , its very clear. Would you mind send me pdf versionplease ?
REPLY
Adrian Tam August 13, 2021 at 5:34 am #
I believe you can print the web page into PDF using your browser.
Leave a Reply
Name (required)
Welcome!
I'm Jason Brownlee PhD
and I help developers get results with machine learning.
Read more