0% found this document useful (0 votes)
74 views

Improve Model Accuracy With Data Pre-Processing

Data pre-processing is an important step for preparing raw data for machine learning models. There are three main types of pre-processing: adding attributes, removing attributes, and transforming attributes. Specific techniques include adding dummy variables, removing correlated attributes, and transforming data through standardization, normalization, and removing skew. Pre-processing can significantly improve model accuracy by simplifying or clarifying relationships in the data.

Uploaded by

prediatech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views

Improve Model Accuracy With Data Pre-Processing

Data pre-processing is an important step for preparing raw data for machine learning models. There are three main types of pre-processing: adding attributes, removing attributes, and transforming attributes. Specific techniques include adding dummy variables, removing correlated attributes, and transforming data through standardization, normalization, and removing skew. Pre-processing can significantly improve model accuracy by simplifying or clarifying relationships in the data.

Uploaded by

prediatech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

 Navigation

Click to Take the FREE Data Preparation Crash-Course

Search... 

Improve Model Accuracy with Data Pre-Processing


by Jason Brownlee on August 15, 2020 in Data Preparation  28

Share Tweet Share

Data preparation can make or break the predictive ability of your model.

In Chapter 3 of their book Applied Predictive Modeling, Kuhn and Johnson introduce the process of data
preparation. They refer to it as the addition, deletion or transformation of training set data.

In this post you will discover the data pre-process steps that you can use to improve the predictive ability of your
models.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials
and the Python source code files for all examples.

Let’s get started.

I Love Spreadsheets
Photo by Craig Chew-Moulding, some rights reserved

Data Preparation
You must pre-process your raw data before you model your problem. The specific preparation may depend on the
data that you have available and the machine learning algorithms you want to use.

Sometimes, pre-processing of data can lead to unexpected improvements in model accuracy. This may be because
a relationship in the data has been simplified or unobscured.
Data preparation is an important step and you should experiment with data pre-processing steps that are
appropriate for your data to see if you can get that desirable boost in model accuracy.

There are three types of pre-processing you can consider for your data:

Add attributes to your data


Remove attributes from your data
Transform attributes in your data

We will dive into each of these three types of pre-process and review some specific examples of operations that
you can perform.

Want to Get Started With Data Preparation?


Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Add Data Attributes


Advanced models can extract the relationships from complex attributes, although some models require those
relationships to be spelled out plainly. Deriving new attributes from your training data to include in the modeling
process can give you a boost in model performance.

Dummy Attributes: Categorical attributes can be converted into n-binary attributes, where n is the number of
categories (or levels) that the attribute has. These denormalized or decomposed attributes are known as
dummy attributes or dummy variables.
Transformed Attribute: A transformed variation of an attribute can be added to the dataset in order to allow a
linear method to exploit possible linear and non-linear relationships between attributes. Simple transforms like
log, square and square root can be used.
Missing Data: Attributes with missing data can have that missing data imputed using a reliable method, such
as k-nearest neighbors.

Remove Data Attributes


Some methods perform poorly with redundant or duplicate attributes. You can get a boost in model accuracy by
removing attributes from your data.

Projection: Training data can be projected into lower dimensional spaces, but still characterize the inherent
relationships in the data. A popular approach is Principal Component Analysis (PCA) where the principal
components found by the method can be taken as a reduced set of input attributes.
Spatial Sign: A spatial sign projection of the data will transform data onto the surface of a multidimensional
sphere. The results can be used to highlight the existence of outliers that can be modified or removed from the
data.
Correlated Attributes: Some algorithms degrade in importance with the existence of highly correlated
attributes. Pairwise attributes with high correlation can be identified and the most correlated attributes can be
removed from the data.

Transform Data Attributes


Transformations of training data can reduce the skewness of data as well as the prominence of outliers in the data.
Many models expect data to be transformed before you can apply the algorithm.

Centering: Transform the data so that it has a mean of zero and a standard deviation of one. This is typically
called data standardization.
Scaling: A standard scaling transformation is to map the data from the original scale to a scale between zero
and one. This is typically called data normalization.
Remove Skew: Skewed data is data that has a distribution that is pushed to one side or the other (larger or
smaller values) rather than being normally distributed. Some methods assume normally distributed data and
can perform better if the skew is removed. Try replacing the attribute with the log, square root or inverse of the
values.
Box-Cox: A Box-Cox transform or family of transforms can be used to reliably adjust data to remove skew.
Binning: Numeric data can be made discrete by grouping values into bins. This is typically called data
discretization. This process can be performed manually, although is more reliable if performed systematically
and automatically using a heuristic that makes sense in the domain.

Summary
Data pre-process is an important step that can be required to prepare raw data for modeling, to meet the
expectations of data for a specific machine learning algorithms, and can give unexpected boosts in model accuracy.

In this post we discovered three groups of data pre-processing methods:

Adding Attributes
Removing Attributes
Transforming Attributes

The next time you are looking for a boost in model accuracy, consider what new perspectives you can engineer on
your data for your models to explore and exploit.

Get a Handle on Modern Data Preparation!


Prepare Your Machine Learning Data in Minutes
...with just a few lines of python code

Discover how in my new Ebook:


Data Preparation for Machine Learning

It provides self-study tutorials with full working code on:


Feature Selection, RFE, Data Cleaning, Data Transforms, Scaling, Dimensionality Reduction, and much more...

Bring Modern Data Preparation Techniques to


Your Machine Learning Projects
SEE WHAT'S INSIDE

Share Tweet Share

More On This Topic

Get Your Data Ready For How to Use The Pre-Trained Feature Selection to Improve
Machine Learning in R with… VGG Model to Classify… Accuracy and Decrease…

What you need to know before How to use Data Scaling How to Improve Deep Learning
you get started: A… Improve Deep Learning Model… Model Robustness by…

About Jason Brownlee


Jason Brownlee, PhD is a machine learning specialist who teaches developers how to get results with modern
machine learning methods via hands-on tutorials.
View all posts by Jason Brownlee →
 Model Prediction Accuracy Versus Interpretation in Machine Learning Clever Application Of A Predictive Model 

28 Responses to Improve Model Accuracy with Data Pre-Processing

REPLY 
Juhyoung Lee July 19, 2016 at 3:54 pm #

Can you explain more detail about concept of attribute and dummy attribute?

REPLY 
Jason Brownlee July 20, 2016 at 5:17 am #

Sure Juhyoung Lee,

You can take a categorical attribute like “color” with the values “red” and “blue and turn it into two binary attributes
has_red and has_blue.

These new binary variables are dummy variables.

You can learn more here:


https://en.wikipedia.org/wiki/Dummy_variable_(statistics)

REPLY 
kay December 12, 2016 at 8:23 am #

It was a great article, although i had a question suppose there is a data set consisting means, modes, min’s,
max’s etc. How can we represent all those values on a common scale,or genralize the values, for example let’s say
mean of heights in a group of people is x, and mode is y, and min value is z, and there is group 2 with same data,
can the values be represented on a common scale?

REPLY 
Jason Brownlee December 13, 2016 at 8:04 am #

Hi Kay,

You can scale each column (data type or feature) separately to a range of 0-1.

You can use the formula:

1 y = (x - min) / (max - min)

Where x is a given value and min and max are the limits of values on the column.

I hope that helps.

REPLY 
Jan July 20, 2017 at 8:25 pm #

Thanks for the article, Jason!


If I have some normally distributed features and some skewed features, can I just transform the skewed data and
leave the normally distributed data untouched? Can I i.e. log transform some features and leave others?

Regards!

REPLY 
Jason Brownlee July 21, 2017 at 9:33 am #

Absolutely.

REPLY 
Efatathios Chatzikyriakidis April 20, 2020 at 7:28 am #

Hi!

You can add in the list also aggregations features, or statistical features in general (5-number summaries),
outliers removal.

REPLY 
Jason Brownlee April 20, 2020 at 7:37 am #

Great tip.

REPLY 
Shabnam December 1, 2017 at 6:26 am #

About “Correlated Attributes” that you mentioned in this post:


I was wondering if you have any post on using it (for example in sklearn), so I can read and understand more.

REPLY 
Jason Brownlee December 1, 2017 at 7:44 am #

I may have an example in the R book. Perhaps search the blog?

REPLY 
Shabnam December 1, 2017 at 6:31 am #

I have a question about scaling. If we have a binary classification (0,1). Is it better to keep it as it is or
change it to (-1,1) for example? Or does it depends on the data?

When I read posts about machine learning, I am not sure, if the notes are always true or not.
I noticed that in many cases, the response depends on the type of data. How can we say if a note/rule/point is
dependent to a data type?

REPLY 
Jason Brownlee December 1, 2017 at 7:44 am #

Really depends on the data and the algorithms being used.


REPLY 
Andrea Grandi December 11, 2017 at 3:58 am #

When you say “Correlated Attributes: Some algorithms degrade in importance with the existence of highly
correlated attributes. Pairwise attributes with high correlation can be identified and the most correlated attributes can
be removed from the data.”

I was totally convinced of the opposite :O

I mean: I usually remove those attributes which are not correlated and keep those highly correlated… this can
explain why I couldn’t optimise my models too much.

Do you have more documentation about this particular subject?

Thanks!

REPLY 
Jason Brownlee December 11, 2017 at 5:33 am #

Multiple correlated input features do mess up most models, at worst resulting in worse skill, at best,
creating a lot of redundancy in the model.

Formally, the problem is often referred to as multicollinearity:


https://en.wikipedia.org/wiki/Multicollinearity

REPLY 
Mohammad Ehtasham Billah February 5, 2018 at 10:12 am #

Among all of these steps which are the most important ones? e.g. If I apply PCA for dimensionality reduction
and create just two new features can I expect that other problems in the data (e.g.outliers, multicollinearity, skewed
distribution) will no longer exist?

for transform attribute in add data attribute section, how can I choose which transformation is best among log, square
and square root? Can I apply each of those transformations and keep all of them in the dataset. I feel like it may
cause redundancy and multicollinearity.

Thank you for your great posts!!

REPLY 
Jason Brownlee February 5, 2018 at 2:52 pm #

The framing the problem offers the biggest leverage.

PCA with outliers might cause problems, better to remove them first.

Try maintaining multiple “views” of the data and try modes on each to explore the best combination of
view/model/config.

REPLY 
Mohammad Ehtasham Billah February 8, 2018 at 6:18 am #

Hi Jason,
For pairwise attributes with high correlation, what is the accepted level? The correlation can be -1 to +1.
REPLY 
Jason Brownlee February 8, 2018 at 8:32 am #

Perhaps below 0.5. I’d recommend testing and see what works best for your data and
algorithms.

REPLY 
maunish September 19, 2019 at 2:04 am #

Hi jason,

I have participated in a Kaggle competition in which i have to classify forest cover type ,
i tried everything i used like stacking various models , feature engineering , feature extraction
but my model accuracy is not increasing above 80%.

i also found that 2 types of cover type are really hard to separate so i tried to build a model to separate this 2 cover
types.

nothing is working so i am frustrated it feels that i am not knowing something that others know.

it would be really helpful if you could give an insight , i am working on this for 2 weeks and very frustrated now.

and sorry for asking such a stupid question and long question.

REPLY 
Jason Brownlee September 19, 2019 at 6:04 am #

I have some suggestions here that may help:


https://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/

And here:
https://machinelearningmastery.com/start-here/#better

REPLY 
rahul June 8, 2020 at 8:34 am #

Hi Jason,
I have a .csv file with 10 columns and roughly 6000 rows. My data is represented in the form of only 0 and 1. Each
row represents a timeframe of a video.
Let’s say I want to bring the number of rows down from 6000 to 1000 rows without loosing information. What method
is reliable in my case? And how it can be done?

REPLY 
Jason Brownlee June 8, 2020 at 1:19 pm #

Without losing information? Not sure I can help sorry.

REPLY 
Priya April 30, 2021 at 7:31 pm #

Hi Sir,
As kNN is distance based algorithm so data normalization may have an impact on this algorithm. But is it possible
that normalization, negatively impact model accuracy with kNN.

(I am getting RMSE=50 without Normalization, and RMSE=70 with normalization in kNN algorithm)
Is it possible or I am doing some logical mistake?

REPLY 
Jason Brownlee May 1, 2021 at 6:04 am #

It is possible that data scaling does not help your model.

REPLY 
Priya May 1, 2021 at 2:40 pm #

thanks for your reply

REPLY 
Jason Brownlee May 2, 2021 at 5:27 am #

You’re welcome.

REPLY 
John Rustam August 13, 2021 at 2:09 am #

i like very much what you write , its very clear. Would you mind send me pdf versionplease ?

[email protected]

REPLY 
Adrian Tam August 13, 2021 at 5:34 am #

I believe you can print the web page into PDF using your browser.

Leave a Reply

Name (required)

Email (will not be published) (required)


SUBMIT COMMENT

Welcome!
I'm Jason Brownlee PhD
and I help developers get results with machine learning.
Read more

Never miss a tutorial:

Picked for you:

How to Choose a Feature Selection Method For Machine Learning

Data Preparation for Machine Learning (7-Day Mini-Course)

How to Calculate Feature Importance With Python

Recursive Feature Elimination (RFE) for Feature Selection in Python

How to Remove Outliers for Machine Learning

Loving the Tutorials?


The Data Preparation EBook is
where you'll find the Really Good stuff.

>> SEE WHAT'S INSIDE

© 2024 Guiding Tech Media. All Rights Reserved.


LinkedIn | Twitter | Facebook | Newsletter | RSS

Privacy | Disclaimer | Terms | Contact | Sitemap | Search

You might also like