0% found this document useful (0 votes)

110 views34 pages

How To Prepare Data For Machine Learning

The document discusses the process of preparing data for machine learning algorithms. It describes a three step process: 1) Select the relevant data, 2) Preprocess the data by formatting, cleaning, and sampling it, and 3) Transform the data through scaling, decomposing attributes, and aggregating attributes. The goal is to engineer the features of the data to be in a suitable format for the machine learning algorithm to make predictions or classifications. Preparing data properly is essential for achieving good results with machine learning.

Uploaded by

prediatech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

110 views34 pages

How To Prepare Data For Machine Learning

Uploaded by

prediatech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

 Navigation

Click to Take the FREE Data Preparation Crash-Course

Search... 

How to Prepare Data For Machine Learning

by Jason Brownlee on August 16, 2020 in Data Preparation  165

Share Tweet Share

Machine learning algorithms learn from data. It is critical that you feed them the right data for the problem you want
to solve.

Even if you have good data, you need to make sure that it is in a useful scale, format and even that meaningful
features are included.

In this post you will learn how to prepare data for a machine learning algorithm. This is a big topic and you will
cover the essentials.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials
and the Python source code files for all examples.

Let’s get started.

Lots of Data
Photo attributed to cibomahto, some rights reserved

Data Preparation Process

The more disciplined you are in your handling of data, the more consistent and better results you are like likely to
achieve. The process for getting data ready for a machine learning algorithm can be summarized in three steps:

Step 1: Select Data

Step 2: Preprocess Data
Step 3: Transform Data

You can follow this process in a linear manner, but it is very likely to be iterative with many loops.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Step 1: Select Data

This step is concerned with selecting the subset of all available data that you will be working with. There is always a
strong desire for including all data that is available, that the maxim “more is better” will hold. This may or may not
be true.

You need to consider what data you actually need to address the question or problem you are working on. Make
some assumptions about the data you require and be careful to record those assumptions so that you can test
them later if needed.

Below are some questions to help you think through this process:

What is the extent of the data you have available? For example through time, database tables, connected
systems. Ensure you have a clear picture of everything that you can use.
What data is not available that you wish you had available? For example data that is not recorded or cannot be
recorded. You may be able to derive or simulate this data.
What data don’t you need to address the problem? Excluding data is almost always easier than including data.
Note down which data you excluded and why.

It is only in small problems, like competition or toy datasets where the data has already been selected for you.

Step 2: Preprocess Data

After you have selected the data, you need to consider how you are going to use the data. This preprocessing step
is about getting the selected data into a form that you can work.

Three common data preprocessing steps are formatting, cleaning and sampling:

Formatting: The data you have selected may not be in a format that is suitable for you to work with. The data
may be in a relational database and you would like it in a flat file, or the data may be in a proprietary file format
and you would like it in a relational database or a text file.
Cleaning: Cleaning data is the removal or fixing of missing data. There may be data instances that are
incomplete and do not carry the data you believe you need to address the problem. These instances may need
to be removed. Additionally, there may be sensitive information in some of the attributes and these attributes
may need to be anonymized or removed from the data entirely.
Sampling: There may be far more selected data available than you need to work with. More data can result in
much longer running times for algorithms and larger computational and memory requirements. You can take a
smaller representative sample of the selected data that may be much faster for exploring and prototyping
solutions before considering the whole dataset.

It is very likely that the machine learning tools you use on the data will influence the preprocessing you will be
required to perform. You will likely revisit this step.
So much data
Photo attributed to Marc_Smith, some rights reserved

Step 3: Transform Data

The final step is to transform the process data. The specific algorithm you are working with and the knowledge of
the problem domain will influence this step and you will very likely have to revisit different transformations of your
preprocessed data as you work on your problem.

Three common data transformations are scaling, attribute decompositions and attribute aggregations. This step is
also referred to as feature engineering.

Scaling: The preprocessed data may contain attributes with a mixtures of scales for various quantities such as
dollars, kilograms and sales volume. Many machine learning methods like data attributes to have the same
scale such as between 0 and 1 for the smallest and largest value for a given feature. Consider any feature
scaling you may need to perform.
Decomposition: There may be features that represent a complex concept that may be more useful to a
machine learning method when split into the constituent parts. An example is a date that may have day and
time components that in turn could be split out further. Perhaps only the hour of day is relevant to the problem
being solved. consider what feature decompositions you can perform.
Aggregation: There may be features that can be aggregated into a single feature that would be more
meaningful to the problem you are trying to solve. For example, there may be a data instances for each time a
customer logged into a system that could be aggregated into a count for the number of logins allowing the
additional instances to be discarded. Consider what type of feature aggregations could perform.

You can spend a lot of time engineering features from your data and it can be very beneficial to the performance of
an algorithm. Start small and build on the skills you learn.

Summary
In this post you learned the essence of data preparation for machine learning. You discovered a three step
framework for data preparation and tactics in each step:

Step 1: Data Selection Consider what data is available, what data is missing and what data can be removed.
Step 2: Data Preprocessing Organize your selected data by formatting, cleaning and sampling from it.
Step 3: Data Transformation Transform preprocessed data ready for machine learning by engineering
features using scaling, attribute decomposition and attribute aggregation.

Data preparation is a large subject that can involve a lot of iterations, exploration and analysis. Getting good at data
preparation will make you a master at machine learning. For now, just consider the questions raised in this post
when preparing data and always be looking for clearer ways of representing the problem you are trying to solve.

Resources
If you are looking to dive deeper into this subject, you can learn more in the resources below.

From Data Mining to Knowledge Discovery in Databases, 1996

Data Analysis with Open Source Tools, Part 1
Machine Learning for Hackers, Chapter 2: Data Exploration
Data Mining: Practical Machine Learning Tools and Techniques, Chapter 7: Transformations: Engineering the
input and output
Do you have some data preparation process tips and tricks. Please leave a comment and share your experiences.

Get a Handle on Modern Data Preparation!

Prepare Your Machine Learning Data in Minutes
...with just a few lines of python code

Discover how in my new Ebook:

Data Preparation for Machine Learning

It provides self-study tutorials with full working code on:

Feature Selection, RFE, Data Cleaning, Data Transforms, Scaling, Dimensionality Reduction,
and much more...

Bring Modern Data Preparation Techniques to

Your Machine Learning Projects

SEE WHAT'S INSIDE

Share Tweet Share

About Jason Brownlee

Jason Brownlee, PhD is a machine learning specialist who teaches developers how to get results with modern
machine learning methods via hands-on tutorials.
View all posts by Jason Brownlee →

 How to Define Your Machine Learning Problem How to Evaluate Machine Learning Algorithms 

165 Responses to How to Prepare Data For Machine Learning

REPLY 
Fraser March 31, 2014 at 4:30 am #

I enjoyed your concise overview, Jason.

Perhaps you can delve a little into the dangers/opportunities in your Step 2: Cleaning stage.

It has been my experience that those data you may want to remove contain the more interesting data to the client
(perhaps only after the requested client questions are addressed).

Fraser

REPLY 
jasonb March 31, 2014 at 5:36 am #

Hi Fraser, good question.

Indeed, it can difficult to know if data is bad and you may not always have a domain expert at hand to comment.
Sometimes it is obvious though, like 0 values that are impossible in the domain like a blood pressure. I’ve also
seen -999 used to signal “not provided”. In these cases we can mark attributes as missing and think about
possible rules for imputing if we so desire.
Where do you draw the line though? Should severe outliers be marked as missing? Sometimes. I like to try a lot
of stuff, for example, I would try removing instances with large outliers in one dimension and see what that did to
my models, I’d also try removing instances with missing values and try models on variations of the data with
imputed value. Almost always, modeling ground truth is not the goal, there are performance metrics like
classification accuracy or AUC that we are being optimized.
You’re right though, sometimes the broken data can represent something very interesting – anomalies that signal
something useful in and of themselves in the domain.

REPLY 
Fraser March 31, 2014 at 6:05 am #

Yes, indeed. Is it an outlier, or a poorly encoded result, or a result with atypical calibration, or does it
represent a distinct and real combination of natural conditions …

I work a lot with chemical concentration data in water and sediment and I run into censored data routinely. Mostly
from the 1000 mg/L. Censored data of this particular type is handled differently by different people and as you
suggest values need to be imputed (with an appropriate sampling distribution) if the rest of a multi-parameter time-
sample result is to remain in the analysis.

For me this is what makes data analysis fun.

I just arrived at your site, and I see so many articles of interest. Thank you for making this available.

Fraser

REPLY 
Fraser March 31, 2014 at 6:07 am #

The use of the angle brackets got lost in my post above.

“Mostly of the type “less than” .01 pg/L but occasionally the other side, say “greater than” 1000 mg/L.”

REPLY 
jasonb March 31, 2014 at 7:42 am #

Insightful comments Fraser, thanks. Reach out any time if you want kick around some ideas on a tough
problem.

REPLY 
Fraser March 31, 2014 at 11:50 am #

Thanks, Jason. I will do that. Fraser

REPLY 
Surajit August 25, 2015 at 10:23 pm #

I like “Getting good at data preparation will make you a master at machine learning”. This is indeed a good
post.

Thanks Dr Jason.

REPLY 
Rohita Gupta February 10, 2017 at 5:43 pm #

Can you please share the link of this article

REPLY 
Jason Brownlee February 11, 2017 at 4:55 am #

I believe Surajit was quoting from this article.

REPLY 
Kiran Garimella November 6, 2015 at 9:20 am #

Great set of articles!

One issue that I run into is that the data sometimes lacks semantic integrity. This is not an issue of missing values,
but just having improper values. When values are of different data types within a column, it is easy to detect and fix.
However, when the data type is the same but the meaning changes, then it’s much more difficult. For example, I’ve
seen sales data where a column named ‘marketing plan code’ would have string data type denoting marketing plan
codes, except in a few cases where the users put in vendor codes because they didn’t have any other field to record
that information.

Any insights and anecdotes about this issue?

REPLY 
Lokesh December 2, 2019 at 4:24 pm #

Hey can you send ma data for CFST column under axial load in machine learning ?

REPLY 
KLeyn May 4, 2016 at 10:14 pm #

Jason, does it affect an algoritm if, during the preparation process I transform the list of rows (like tables,
where the key columns repeats) to pivot tablee, where the key colums shows once and a lot of columns (say
hundreds) have parcial sums or counts for the different conditions (let’s say sales of january in one column, sales of
february in a second column and so on).
Does it would make muiltcorrelation as some columns could be aggregated to one?

REPLY 
mokhtar May 12, 2016 at 4:44 am #

Thank you for your valuable information for this important area Machine learning, started with data structure
and going further to build it complete.

REPLY 
ali October 21, 2016 at 9:50 am #

how can i make one attribute as decision attribute in the data set in order to classification model depend on
the selected attribute

REPLY 
Jason Brownlee October 22, 2016 at 6:54 am #

Hi Ali,

Different algorithms will chose which variables to use and how to use them. You can force a model to use one
variable by deleting all of the other variables.

REPLY 
Avin October 31, 2016 at 11:21 am #

Hi Jason,

Appreciate the effort you put into the great article.

I am currently working on a project on a government data set to find if an entity(person or an individual) were
involved in a a positive or a negative way. I took a flat file containing some test data and prepared the code to
perform sentiment analysis using Naive Bayes algorithm using NLTK python modules.
– In most cases we have a defined trained data set tagged as ‘positive’ or ‘negative’ (e.g movie reviews, twitter data
set). In my case there is no existing trained government data set.
– The training data is available but I need to categorize the training data set as ‘positive’ or ‘negative’.
– My question here is, how do we go about classifying my government data as ‘positive’ or ‘negative’.

I’m looking forward on your advice on how to categorize my government training data as positive or negative. This is
very important for me to get my sentiment analysis with best possible accuracy.

REPLY 
Jason Brownlee November 1, 2016 at 7:58 am #

Hi Avin, I would advise you locate a subject matter expert to prepare a high-quality training dataset for
you (manual classifications).

REPLY 
Mayur November 4, 2016 at 6:01 am #

What is the best way to process large amount of data for machine learning?

REPLY 
Jason Brownlee November 4, 2016 at 11:14 am #

Hi Mayur,

That depends on the problem and how the data is currently represented and stored. No silver bullets, sorry.

REPLY 
Ivan November 9, 2016 at 12:27 am #

My current and first ML project has natural language as it’s input and I spent a huge chunk of time on
preparing it.

I stopped once the data reached a “reasonable” level so that I could continue with the project, i.e. I’m dropping the
hard to parse cases and might return to them later once the whole pipeline is ready for testing.

Keeping the 80/20 rule in mind.

REPLY 
Jason Brownlee November 9, 2016 at 9:52 am #

Nice work Ivan.

REPLY 
ted January 3, 2017 at 3:14 am #

Thank you for your valuables posts, my question is how to apply machine learning to Cancer Registry data
set?

I have two datasets:

1. Dataset1 :
About 18K observation and 22 variables: the five years data set includes
Demographic, Diagnoses, and treatments,

2. Dataset2:
aggregate vitals based on race grouping on: regions, stages, vitals

Thank you for your help

Ted

REPLY 
Jason Brownlee January 3, 2017 at 7:39 am #

Great question ted, see this process to help you work through your problem:
https://machinelearningmastery.com/start-here/#process

REPLY 
José Alberto Ramos Silva March 11, 2017 at 11:46 pm #

Hi Jason, thank you for the great effort and knowledge put into all these posts!
My question will probably be silly, but since I’m a complete n00b I’ll do it just the same.
Data prep, feature analysis and engineering will get you a set of data in a format completely different from original
data. These data transformation steps may be very hard to do automatically. My problem is related to classification, I
am using NN which may not be the best choice, but hey, humor me 😉
So, cutting short. Originally, I get raw data, I prep and transform it. The transformed data will train and test “my” NN.
Now, the “real world” will challenge my model with raw data, presumably with the same format as my original training
set, minus the classification ( of course…). Now, I suppose I’ll have to go through the same data transformation of the
data before the trained model can be fed with it. Right? Doesn’t this mean extra care must be taken to make the data
transformation process (at least ideally) automatic itself?
Sorry for the long question, hope to hear your thoughts on these points. And thank you once again!

REPLY 
Jason Brownlee March 12, 2017 at 8:27 am #

Very good question José!

Yes. Any data transformation performed on data used to fit your model must be performed on data when making
predictions.

This means we need a very clear recipe for this transform, ideally automatic and also in the case of regression
problems it must be reversible so that we can convert predictions back into their predictions scale for use.

REPLY 
José Alberto Ramos Silva March 16, 2017 at 9:47 am #

Thank you very much Jason. And keep up the excellent job you are doing!

REPLY 
Jason Brownlee March 17, 2017 at 8:23 am #

Thanks José.
REPLY 
dhanpal singh April 29, 2017 at 7:51 am #

what is the best book to learn how to prepare the datasets for machine learning models

REPLY 
Eric Kraemer August 12, 2017 at 7:19 am #

I would like to offer that within your topic of “Select Data” you offer a bit more explicit guidance on the topic
of assessing and characterizing data quality. It’s cliché, but garbage-in-garbage-out is a fundamental concept. I so
often come across advanced analytic initiatives that have started out with Assumptions for quality of “selected” data
and moved on – only to find out months later that everything has to reset to basic principles of data acquisition and
management.

What transforms have been applies to source data by systems that precede the database you are selecting from?

If sensor data is involved, what formatting, precision, transformations, signal processing, etc. have been applied?

If data is being acquired from multiple, disparate systems what formatting, scale, and precision differences are being
masked by the database system you are selecting from?

Just a few examples.

REPLY 
Jason Brownlee August 13, 2017 at 9:43 am #

Really good points Eric.

It’s hard to give general advice on data prep because of all of the detail in specific data matters.

It’s not like algorithms where you can say “try everything and see what works on your data”.

REPLY 
Rahul Shukla September 28, 2017 at 2:37 pm #

how we save our preprocessed data into database and how we train model using this data.

REPLY 
Jason Brownlee September 28, 2017 at 4:47 pm #

You could save it to CSV file. Pandas and Python standard libraries offer functions to do this.

REPLY 
rahul shukla September 28, 2017 at 6:03 pm #

is csv is the better way to save sentiment data or use some nosql database to store.

REPLY 
Jason Brownlee September 29, 2017 at 5:02 am #

There is no best way. Perhaps choose an approach that will work best for your project.
Rahul Shukla September 29, 2017 at 2:21 pm #

Sir suggest me an approach for twitter sentiment analysis using deepl learning.

Jason Brownlee September 30, 2017 at 7:36 am #

Perhaps this tutorial will get you started:

https://machinelearningmastery.com/predict-sentiment-movie-reviews-using-deep-learning/

REPLY 
Aniket Saxena October 20, 2017 at 3:17 pm #

Hi Jason,

when I go through UCI Machine Learning Repository following doubts have occured:

1. in bike sharing dataset, I saw two .csv files(one is day.csv and another is hour.csv). So,i can’t understand how to
make this dataset suitable for me to apply machine learning algorithm on it to make predictive model by splitting the
whole dataset into train and test sets?

2. in this repository, I saw dataset characteristics as multivariate and univariate, what does this mean?

3. in this repository, whenever I explore any of the dataset, there is no statement present there to mention which is
the feature to be predict by applying machine learning algorithms?

4. what if both numeric types(float as well as integer) values in any of the feature exist in a dataset? Should we scale
the feature values(integer) to float in order to get good predictive model?

Please help…….

REPLY 
Jason Brownlee October 21, 2017 at 5:25 am #

Each dataset is different. You will need to take care and discover how to prepare each one.

Univariate means one variable, feature or column (all the same thing), multivariate means many.

You might have to check the data or read the associated paper.

Depending on the algorithm used, you might need to convert all features to numeric.

REPLY 
Aniket Saxena October 21, 2017 at 1:41 pm #

So, this means that we have to convert the integer values of all feature exist in a dataset into float values in
order to increase the accuracy of our model? Correct me if I am wrong.

What do you think if there are two .csv files in a dataset, how should we prepare this type of dataset? Please
recommend me a way to do this.

Thanks…
REPLY 
Jason Brownlee October 22, 2017 at 5:15 am #

Perhaps, it depends on the algorithms being used. I would recommend trying it.

If there are two files, I would recommend combining the data into one file.

REPLY 
Aniket Saxena October 22, 2017 at 2:37 pm #

Hello Jason,

When I saw bike sharing dataset in UCI Machine Learning Dataset, UCI mention it’s dataset characteristics as
univariate despite having total of 16 features(columns). Why is this so? Shouldn’t it be multivariate, instead.

Secondly, as you have recommended to join two .csv files into one, but when I use this dataset, I noticed that both of
the files have same features(except hr(hour) available only in hour.csv file not in day.csv file) with different values in
each of the same features available in both the files. In this kind of situation, if I join both the files, values get
redundant and even features as well. So, what do you recommend, how to prepare my dataset in this type of
situation?
Thanks for your quick response to previous question….

REPLY 
Jason Brownlee October 23, 2017 at 5:39 am #

Perhaps they define univariate in terms of output variable only.

Sorry, perhaps I don’t have enough information to give you good advice on how to prepare your data.

REPLY 
Aniket Saxena October 25, 2017 at 3:55 am #

Thanks for your help on this topic but please whenever you will come to know about how to prepare this
type of dataset, tell me or recommend me as well at that moment of time.

Thank you so much for guiding me how to prepare any dataset by creating this amazing post.

REPLY 
Gene November 4, 2017 at 7:59 pm #

What happens if I use a data that does not have a normal distribution?
Are some ML algorithms only suitable for data that are assumed to be normal?
How can I identify whether an algorithm works with normal/non-normal or just normal data?

REPLY 
Jason Brownlee November 5, 2017 at 5:16 am #

In practice, you can often get good results by breaking these rules.

I would recommend testing a suite of algorithms on your data and double down on what seems to be working.
REPLY 
Krish December 12, 2017 at 1:48 am #

Thanks for your help. Can you please suggest me what is the best way to deal with a dataset that contains a
lot of text columns. Also the values of these columns too have a huge set of different values.

REPLY 
Jason Brownlee December 12, 2017 at 5:35 am #

I have some ideas here that may help:

https://machinelearningmastery.com/start-here/#nlp

REPLY 
Everton January 10, 2018 at 12:20 am #

Hi Dr. Jason,

Thank you for your work, I really appreciate your efforts in helping us.
I am a BIG fan.

First off all, I’m planning to use a LSMT-RNN in multivariate time series problem.

I’m beginning my studies in machine learning and probably my question is very silly, but to me is a big issue.

I have a time series database with 221 features not supervisioned yet, wich I would to transform to a input with to 6
up to 10 features. After this, I would like to supervise the output up to 10 time-steps with 1 feature.

I had preprocessing my database by: cleaning, detrending, normalizing, correlating and clustering by affinty. I got 27
clusters from my 221 features.

My question is:

Now I think I can choose my input data, but how? Should I pick from the same cluster that my output have affinity
with output, or should I pick from other clusters that don’t have affinty with my output?

Thx for your time, sry about the big text.

REPLY 
Jason Brownlee January 10, 2018 at 5:28 am #

Perhaps try a few methods and see which is easier to model.

REPLY 
Everton January 11, 2018 at 1:54 am #

Sorry, but I didn’t get the answer.

REPLY 
Jesús Martínez February 8, 2018 at 4:51 am #

Good article, Jason.

Another data processing technique that is commonly used today, particularly in computer vision, is data
augmentation where basically we introduce small changes such as rotations, coloring, and translations to images in
order to emulate different conditions.

REPLY 
Jason Brownlee February 8, 2018 at 8:31 am #

Here are some examples:

https://machinelearningmastery.com/image-augmentation-deep-learning-keras/

REPLY 
Surya Gupta February 11, 2018 at 5:05 pm #

hello,
Actually, I am new toML, I want to know that when we apply data preprocessing on a dataset, whether we have to
change the existing dataset or we have to create a new dataset for the modified data? Means after preprocessing is
done will we be having two datasets, one the actual dataset and the other preprocessed dataset or there will be only
one dataset with preprocessed data?

REPLY 
Jason Brownlee February 12, 2018 at 8:27 am #

Create and save a new dataset or views on your raw data.

REPLY 
chini February 15, 2018 at 5:43 pm #

hi sir actually i want to prepare a data set for speaker recognition project for that i would like to prepare
audio recorded data will you please mention the best procedure for that.

REPLY 
Jason Brownlee February 16, 2018 at 8:33 am #

Sorry, I don’t have material on preparing audio data. I hope to cover it in the future.

REPLY 
vikash February 19, 2018 at 4:41 pm #

Hi ,
I am vikash I want to know about the assumptions means about the pre-validation and post validation of data.for
example for linear regression we have pre-validation or diagnosis like
1.normal distribution of data
2.No multicollinearity
3.Linear relationship
and
4.Missing values
for Post validation or diagnosis after creating the linear regression mode there are
1.Normality of errors
2.Homoschedasticity
3ouliers and levrges
5auto corelation.
these are the assumptions for Linear Regression .What about the rest of algorithms assumptions ? can you guide me
the assumptions for other algorithms .

Thank you.

REPLY 
Jason Brownlee February 21, 2018 at 6:23 am #

Often you can get good results or even better results if you ignore these type of assumptions. The
reason is that is in predictive modeling, model skill is more important than theoretical correctness.

REPLY 
gayathri April 6, 2018 at 2:08 pm #

I have a CSV file timestamp, hostname, metric (CPU,MEM,PAGESCAN), metric vaule (0.7). I need to find
the increase in metric value due to cpu or mem or pagescan.

If CPU is increased then which host is maximum utilizing CPU like that finding the root cause.

The data set contains both categorical values and numerical values. Do I want to convert the categorical data like
hostname and metric value to numerical.

Do i need to do data transformation?

What machine learning techniques will well predict the root cause ? which algol.

I am trying to use spark ML.

Any suggestions.

Thanks

REPLY 
Jason Brownlee April 6, 2018 at 3:51 pm #

Yes, I would recommend converting categorical data to integer or even one hot encoding prior to
modeling.

I would recommend testing a suite of methods on your data to see what works best. Then double down on that.

Sorry, I don’t have examples for Spark.

REPLY 
Ernest April 16, 2018 at 2:33 am #

Hello,
I’m new in Machine Learning so I have a question. Input data have to be the same size?
I mean, I have 10 matrixs with data, but matrixs have size for example [60, 120], [60, 460], [60, 340] and so on. I
want to use Tensorflow Engine.
I would be grateful if you could answer my question.
Regards!

REPLY 
Jason Brownlee April 16, 2018 at 6:11 am #
Yes, generally data must have the same shape.

REPLY 
Adama April 17, 2018 at 1:47 pm #

Bonjour Jason,
moi je sollicite votre soutient documentaire par rapport à mon projet “Techniques pour la préparation des
données pour des projets de science de données”. j’ai lu les différents commentaires, mais mon projet en
demande plus d’avantages sur les différentes et méthodes. on me demande de:
Faire un état de l’art des techniques et outils pour la préparation des données et regrouper les approches en
fonction des méthodes et techniques utilisées.
Faire une synthèse des avantages et inconvénients des méthodes les plus pertinentes de l’état de l’art et
proposer un processus pour la préparation des données.
je suis vraiment dans le besoin d’orientation et de documentation. Merci

REPLY 
Jason Brownlee April 17, 2018 at 2:57 pm #

Hi Adama, I think if you’re having trouble with your homework project that my best advice is to
talk your professor and teaching staff. You are paying them to teach you.

Data preparation is really specific to aa given type of data and predictive modeling problem. Perhaps you
can focus your attention more to make the project easier.

REPLY 
Cheyne Ravenscroft April 18, 2018 at 11:53 pm #

Hi Jason,

i really like the site and there is a lot of really useful things here, i’m presented at the moment with a problem.

I’m attempting to classify a number of scanned PDFs based on the machine read text within them, i’ve got to the
point where i have a relatively large test set.

The documents themselves have extremely predictable sentences which tie in very closely with the classification
however all i’ve managed to really find on this is using the BoW model.

Would using a neural net to achieve this be a viable option? Also i’m having some problems with the pre-processing
of the data. I’m not 100% on the best way to remove ‘\n’ characters and other punctuation from the large text strings.

any help or pointers would be greatly appreciated.

Many thanks,

Cheyne

REPLY 
Jason Brownlee April 19, 2018 at 6:34 am #

It is hard for me to tell. I would generally recommend testing a suite of methods to see what works best
for your specific data.

Let me know how you go.

REPLY 
zino April 26, 2018 at 7:43 am #

hi Jason , i like a lot your way to explain machine learning.

i am working on combining machine learning techniques , and my question : there is ML problems where there are
enough datasets to validate my work.

REPLY 
Jason Brownlee April 26, 2018 at 3:01 pm #

Here are some places to get datasets:

https://machinelearningmastery.com/faq/single-faq/where-can-i-get-a-dataset-on-___

REPLY 
Nil May 3, 2018 at 3:01 am #

Hi DR Jason,

It is a very good guide.

I have a question, I am writing a neural network from scratch (back propagation algorithm) using sigmoid function so I
have scaled my data in a range between -1 and 1 ]-1,1[ but sigmoid function give results between 0 and 1. So I
would like to know if I must scale my data in a range between 0 and 1 [0,1] for sigmoid function?. Or would DR Jason
please make me clear if there is a recommended scale of data when using a sigmoid function? or what is the
recommended scale for sigmoid function?

Best Regards.

REPLY 
Jason Brownlee May 3, 2018 at 6:37 am #

Perhaps this post will help:

https://machinelearningmastery.com/implement-backpropagation-algorithm-scratch-python/

REPLY 
Nil May 4, 2018 at 4:11 am #

Thank you.

Best regards.

REPLY 
Nil May 9, 2018 at 5:45 pm #

Hi DR Jason,
I’ve read the recommended post was really helpful thank you so much.

Best regards.

REPLY 
Jason Brownlee May 10, 2018 at 6:26 am #
I’m glad it helped.

REPLY 
Jeremy May 19, 2018 at 10:49 am #

Hi Dear Jason,
Thanks four this overview. I would like to know in which format I should prepare my data for Non-dominated Sorting
Genetic Algorithm 2 MATLAB. Thanks!

REPLY 
Jason Brownlee May 20, 2018 at 6:34 am #

I believe that is an optimization algorithm, not a supervised learning algorithm. I don’t know what you
mean exactly?

REPLY 
Ekaterina June 1, 2018 at 2:41 am #

Hi Jason, I am working with a dataset that has a lot of similar data items (e.g. mobile phone data). So, I
would like to do the diversity-based sampling. What is the best way and tools to do it?

REPLY 
Jason Brownlee June 1, 2018 at 8:25 am #

Perhaps clustering and filter based on distance to cluster centroids?

Perhaps check the literature?

REPLY 
Ansh July 23, 2018 at 4:34 am #

Hi Jason,

I’m trying to create a classification LSTM model. I have three categorical variables apart from my predictor variable. I
have label encoded all the three variables. Do i need to scale the variables or I could use them as is .

REPLY 
Jason Brownlee July 23, 2018 at 6:15 am #

Try both and see what works best.

REPLY 
thanuja November 26, 2018 at 9:41 pm #

Hi jason

I need to fetch questions from question and answer datasets using any one of the ml algorithm..can you tell me
which algorithm is best..? and procedure..?
REPLY 
Jason Brownlee November 27, 2018 at 6:34 am #

This is a common question that I answer here:

https://machinelearningmastery.com/faq/single-faq/what-algorithm-config-should-i-use

REPLY 
nouf December 3, 2018 at 10:27 pm #

Dear Jason,
Thank you first of all for this amazing website.

I am working on Sentiment Analysis application for my MSc and I am pretty beginner in this feild
I have collected the data from twitter but I want to know shall I clean the data before or after Annotation ? will the
order make a difference ?

REPLY 
Jason Brownlee December 4, 2018 at 6:02 am #

A good place to get started is here:

https://machinelearningmastery.com/start-here/#nlp

REPLY 
kehinde January 29, 2019 at 12:50 am #

Dear Jason,
I have some unstructured json files that I need to preprocess as input to my machine learning algorithm, please any
help with how to create feature vectors with unstructured json files

REPLY 
Jason Brownlee January 29, 2019 at 6:14 am #

You might have to write some custom code.

REPLY 
Toufik February 8, 2019 at 10:18 pm #

Hello Jason thank’s for your tutorial, i have a question for softmax function , the derivation of this function

REPLY 
Jason Brownlee February 9, 2019 at 5:56 am #

Perhaps this will help:

https://en.wikipedia.org/wiki/Softmax_function

REPLY 
VS February 14, 2019 at 5:34 am #

Hi Jason,
Thanks again for a good read. In cases when we don’t have an inherent category/class backed up by literature, do
you think its okay to use the mean value as a cut-off for classes?. For example, say we’re trying to separate between
high performers and low performers in a workplace based on a survey outcome. Now that survey doesn’t have an
exact cut-off saying anybody who gets above 10 out of 20 is high performer and below is low-performer. One thing
that I guess we could do is just use clustering first to divide the dataset into two clusters and use that as classes.
Other than that, would it be okay to calculate the mean score among all the participants and then use that as a cut-off
to divide the sample to high and low class. And then use that for train/test? Does it make sense, you think? This is
assuming that the data is normally distributed. If not, a percentile based approach might be good. Anyway, do you
think it is okay to create classes based on the average score? If not, what might be some other ways to divide the
classes based on a numerical value if there’s no inherent category? The reason this comes up is because I’m trying
to convert a regression problem to a classification problem but I am not sure if classifying based on mean is a good
idea.

Thanks!

REPLY 
Jason Brownlee February 14, 2019 at 8:50 am #

You’re approach sounds like a good start.

I recommend testing a few diffrent framings of the problem and discover what works well/best for your specific
dataset.

REPLY 
Jennifer Watling February 16, 2019 at 8:40 am #

Hi Jason,
How do you deal with missing data in your data set, do you just make them NA? I am using movie data and the
variable that has missing data is the actors name. I put NA in this variable and it made 2829 NA out of 14800 records,
I believe this could be a problem but wasn’t sure how to address it.

Thanks
Jennifer

REPLY 
Jason Brownlee February 17, 2019 at 6:28 am #

You have many options, such as removing records with missing data (columns or rows) and imputing
values via mean/median or via a model, or mark with a special value and ignore during modeling.

Perhaps try modeling with a few variations and see what works best for your specific data.

REPLY 
Halima February 24, 2019 at 8:49 am #

hi Jason , thank you very much for your amazing websites it wasso usefull for me .
i have training dataset (labled )includes many instances i’ll use it in classification methods. now i want to know if can i
use test dataset includes just one instance (of course unlabled ) ??

thanks

Halima
REPLY 
Jason Brownlee February 24, 2019 at 9:16 am #

A testset with one instance does not make sense.

Alternately, if you just want to make a prediction on new data, then perhaps this will help you:
https://machinelearningmastery.com/make-predictions-scikit-learn/

REPLY 
Halima February 26, 2019 at 3:52 am #

thanks a lot . exactly ia want to make a prediction on new data . but in weka not python
how can i do this in weka ?

REPLY 
Jason Brownlee February 26, 2019 at 6:29 am #

Here is an example:
https://machinelearningmastery.com/save-machine-learning-model-make-predictions-weka/

REPLY 
Prem Alphonse March 25, 2019 at 9:19 am #

Hi Jason,

1. I have the Application dataset in json format, I convert to flat csv to work in r because it has nested fields, Is that a
right approach?

2. In the data, some records have more than one applicants, but maximum two or three applicants,few have more
than 3, in that case in the flat file the variable columns that doesn’t belong to more than 3 applicants are mostly NULL
as only few records have that, May I know how to handle that please? (example record information is: Applicant
Names, Phone, Company, Salary, Asset Amount, Age, Gender, ….)

Thanks in Advance

REPLY 
Jason Brownlee March 25, 2019 at 2:15 pm #

Sounds like the right approach.

Perhaps you can mark the values as missing:

https://machinelearningmastery.com/faq/single-faq/how-do-i-handle-missing-data

REPLY 
Prem Alphonse March 25, 2019 at 4:58 pm #

I am in the right path then, Thanks Jason

REPLY 
chesang sumukwo May 27, 2019 at 4:56 pm #
Hi Dr.
am an MSc. student working on machine learning specifically convolution neural network to predict phenotypes using
genomic data . my data set is coded {0 1 2}. my question are;
1. is it possible to use the same format code to make prediction or am suppose to transform it to {0 1}
2 if am to transform how can i go about

REPLY 
Jason Brownlee May 28, 2019 at 8:09 am #

Yes, you can model a problem as multi-class classification, perhaps this will help:
https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/

REPLY 
Simon Downes June 7, 2019 at 1:53 am #

On the point of Data Cleaning:

“Additionally, there may be sensitive information in some of the attributes and these attributes may need to be
anonymized or removed from the data entirely.”

We have ML running on a data lake containing raw system data.

How common is the issue of needing to provide anonymised data sets for building data models dependent on
business function accessing and their identified ‘Legal Basis for Processing’, for example under GDPR

REPLY 
Jason Brownlee June 7, 2019 at 8:06 am #

Case by case really, based on the type of consent users gave.

The world is very different now compared to the 90s/2000s when data was scarce and it was a free for all.

REPLY 
Prem Alphonse June 7, 2019 at 5:36 pm #

Hi Jason, How to do the preprocess of single test record to follow the same procedure used while training
the model?

REPLY 
Jason Brownlee June 8, 2019 at 6:49 am #

If you have a pipeline, it will perform the preparation for you.

REPLY 
Anamika June 11, 2019 at 6:16 pm #

Hi Dr.Jason

Can you mention to learn decomposition and aggregation transformation. I am not getting clear insight for this.

Thanks
REPLY 
Jason Brownlee June 12, 2019 at 7:53 am #

Sure.

Decomposition: a date can be split into date/month/year.

Aggregation: customer transactions can be aggregated to give sums and averages.

REPLY 
Youssef Mellah August 5, 2019 at 8:23 am #

Hy Jason,

Thank you for this helpful post.

I have a question about data preparation. I work on Text-to-Code task and i hava a csv file that contain my dataset, it
contains 2 columns : Text and Code.

In order to use data in training, i encode X-Train into integer and pad it into max length. But for Y-Train, how can i
process it? with the same way as X-train?

REPLY 
Jason Brownlee August 5, 2019 at 2:00 pm #

Same idea, encode symbols as integers and zero pad to fixed length. The map integers back to symbol
to give the final output.

REPLY 
Youssef MELLAH August 15, 2019 at 12:42 am #

ah okay, thanks again for helping !!

REPLY 
Jason Brownlee August 15, 2019 at 8:12 am #

No problem.

REPLY 
Serhat Inceer October 26, 2019 at 1:47 am #

Hello Jason first of all I am really thankfully for informations, which you share with us. I want to ask you ; I
am a beginner for Machine learning .

How can we convert the genomes into features in order to feed a Machine Learning algorithm?

and in case of having a heavily unbalanced training list, how would it affect my results? how can i solve it?

Thank you very much

REPLY 
Jason Brownlee October 26, 2019 at 4:41 am #
I don’t know about representing genomes, sorry. Perhaps check the literature to see how this is done?

There are many approaches to addressing class imbalance, perhaps start here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

REPLY 
Serhat Inceer October 26, 2019 at 11:42 pm #

Thank you very much I will keep searching….

REPLY 
Michael Gibson November 12, 2019 at 1:23 am #

Hi Jason, I’m really enjoying your books and emails about machine learning. I’ve started a project for my
final year at university on face recognition but really struggling to source a large database of photos to train a network
from scratch. How many photos of each person would you recommend for accuracy? And do you know a source that
I can access the face images?

Kind regards

Michael

REPLY 
Jason Brownlee November 12, 2019 at 6:43 am #

Thanks!

Perhaps test the sensitive of the model to the number of faces for each person?

Also, for faces, use a facenet or vggface2 to get the embedding, then another model to do the actual
classification.

REPLY 
Senthilkumar Radhakrishnan November 26, 2019 at 8:46 pm #

hi Jason,

I have a doubt whether if i preprocess my training data ,how can i preprocess my test data …provided that for
example if i have different labels in case for encoding in test data what should i do? >>

REPLY 
Jason Brownlee November 27, 2019 at 6:04 am #

Test data must be prepared using the same methods as were used to prepare the training data.

REPLY 
Senthilkumar Radhakrishnan December 2, 2019 at 6:54 pm #

Thank you for your reply Jason,

if both train and test have different labels for a common column
eg: col1 in train has unique values of a,b,c
also col1 in test has unique values of a,b,d
how can this be encoded if i recieve unknown label error occurs..

REPLY 
Jason Brownlee December 3, 2019 at 4:50 am #

If you know the extent of categorical values beforehand, you can specify them to the
onehotencoer so that it can handle all possible cases.

Or if you don’t you can set “handle_unknown” to “ignore”:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Senthilkumar Radhakrishnan December 5, 2019 at 9:12 pm #

Thanks for your reply Jason, it was informative!!!

Jason Brownlee December 6, 2019 at 5:14 am #

You’re welcome.

REPLY 
Surya December 5, 2019 at 9:19 pm #

what if I have high cardinal(more unique values) categorical variable, that needs to be encoded ,
which is the best encoding method to use? Can you help with this !

REPLY 
Jason Brownlee December 6, 2019 at 5:15 am #

I recommend this tutorial as a first step:

https://machinelearningmastery.com/how-to-prepare-categorical-data-for-deep-learning-in-python/

Then this:
https://machinelearningmastery.com/faq/single-faq/how-do-i-handle-a-large-number-of-categories

REPLY 
Surya December 6, 2019 at 9:39 pm #

But those hashing and other techniques are applicable only for text data..what if i have a data for
numerical analysis of classification or clustering?

REPLY 
Jason Brownlee December 7, 2019 at 5:38 am #

The same methods can be used. Words are just many categories.
Surya December 9, 2019 at 4:56 pm #

Thank you !!!

Jason Brownlee December 10, 2019 at 7:25 am #

You’re welcome.

REPLY 
Sudipta Hazarika January 16, 2020 at 12:23 am #

Hello, Dr. Jason,

I have a question, can sampling be also referred to as feature selection/extraction. I have seen some papers, where
they have taken some characteristics of a feature (like AUC, Maximum value, curve fitting coefficients) instead of the
original data.
I have been working on classifications of sensor responses of 9 sensors taken for 240 seconds. i.e for each sample
of my experiment, I have a data matrix of 240 observations and 9 features (240*9). Thereafter, I selected some
representative points, like (the maximum value, the 75th, 50th, and 25th percentile values) to make the system fast,
while keeping the performance of the classification at par with the same when using the entire dataset.
How do I present this work as a section in a paper (feature extraction or sampling)
Regards

REPLY 
Jason Brownlee January 16, 2020 at 6:20 am #

This sounds like feature engineering:

https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-
good-at-it/

REPLY 
Sudipta Hazarika January 17, 2020 at 7:56 am #

Thank you somuch, your explanations are really helpfull.

Regards

REPLY 
Jason Brownlee January 17, 2020 at 1:48 pm #

You’re welcome.

Sudipta Hazarika January 22, 2020 at 10:06 pm #

Sir, I still have a doubt, with my earlier question in this thread. The original dataset had a
dimension of 240*9 for each sample and for 46 samples the size of the entire data set was 46*
(240*9), ie approx. 11000*9. After doing the said feature engineering, my dataset is reduced to
188*9. I have applied a classifier algorithm for classifying the new dataset and achieved good
accuracy.

My question is whether this size difference has to be accounted for. The size of the training set is
small now and so is the testing set. Since I did use bagging KNN and AdaBoost Decision tree
classifier (decision stumps) for the original dataset, the same has been applied here also.
Sir, I have to say that you are the only accessible expert in the field and I am indebted for your
guidance.
Regards

Jason Brownlee January 23, 2020 at 6:32 am #

Not sure I can help. Sounds like you need to debug your data preparation procedures to
understand what they are doing.

REPLY 
Sudipta Hazarika January 23, 2020 at 8:30 pm #

Thankyou

Jason Brownlee January 24, 2020 at 7:46 am #

You’re welcome.

REPLY 
Mike Kelly January 22, 2020 at 9:57 am #

Hi Jason. Do you have an article that describes how to prepare normalized data for machine learning? If a
parent observation has many child values for a given feature, how do we represent that in a single row? If we just
assign each value to a different column, the model would consider each column as a different feature and each
observation may have their values in a different order than the other observations. Thanks for your insight.

REPLY 
Jason Brownlee January 22, 2020 at 1:55 pm #

Yes many, perhaps start here:

https://machinelearningmastery.com/faq/single-faq/when-should-i-standardize-and-normalize-data

Yes, that might be a good first thing to try, other ideas are here:
https://machinelearningmastery.com/how-to-prepare-categorical-data-for-deep-learning-in-python/

REPLY 
Mike Kelly January 22, 2020 at 4:01 pm #

Thanks for your response, Jason. I think I was using the wrong terminology in my question. By
normalized, I meant in terms of relational database structure. In other words, a categorical feature where an
observation can have multiple instances of the different category values. It seems that the solution would still
be to use dummy encoding but in this case an observation could have a 1 in more than one column. I just
need to find the Pandas method that can take multiple tables into account during dummy encoding.

REPLY 
Jason Brownlee January 23, 2020 at 6:25 am #

Perhaps de-normalize data to row row per example prior to data prep?

REPLY 
Karthick January 27, 2020 at 5:31 pm #

sir when we have to perform outlier detection please upload a post for how to remove outliers in multivariate
classification and regression using python.

REPLY 
Jason Brownlee January 28, 2020 at 7:50 am #

Perhaps start here:

https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/

REPLY 
Mau February 12, 2020 at 5:07 pm #

My Data-set contains drived attribute which is calculated by subtracting two attributes , so is it good or
should i remove this attribute while building the model.

REPLY 
Jason Brownlee February 13, 2020 at 5:38 am #

Compare model performance with and without the variable.

REPLY 
Samaneh February 27, 2020 at 2:17 am #

Hello, I still wonder how I can work with my data. I prepared a data file in excel to use it in a deep learning
model. I don’t know how I can label all the features as ‘data’ and the class feature as ‘target’!

please help me with this issue.

Best

REPLY 
Jason Brownlee February 27, 2020 at 5:56 am #

This will help:

https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

REPLY 
Samaneh February 27, 2020 at 9:10 pm #
Thank you for your response but It didn’t have any notes about my question! I’ve prepared a
database regarding the defined problem.

REPLY 
Jason Brownlee February 28, 2020 at 6:06 am #

The linked article will help you to identify the input and output parts of your predictive modeling
problem.

Samaneh February 29, 2020 at 1:08 am #

Thank you for all your articles, very helpful 🙂

Jason Brownlee February 29, 2020 at 7:16 am #

You’re welcome.

REPLY 
Sverre February 27, 2020 at 2:39 am #

Hi Jason,

I dont know if this question has been posed, but when in this process
do you recommend to split the dataset into training and validation set
in terms of avoiding data leakage?

REPLY 
Jason Brownlee February 27, 2020 at 5:57 am #

Good question, split first.

Also, this will help:

https://machinelearningmastery.com/difference-test-validation-datasets/

REPLY 
Piyush Pandey March 28, 2020 at 11:32 pm #

Hi Jason

I found your article very interesting. I was making a model for the famous ‘Titanic’ dataset. I found out that even after
using XGBoost I wasn’t getting accuracy above 77%. I realised that this maybe because of poorly prepared data, so I
began searching for the best ways to prepare data. As I am new to this field I didn’t knew about decomposition and
aggregation.

My question is how to tell when to decompose/aggregate a feature? Also do you have any link to articles or tutorials
about data preparation, it would be a great help.

Piyush
REPLY 
Jason Brownlee March 29, 2020 at 5:56 am #

Good question – try it and see if it improves performance. If it does keep it and continue to try other
things.

REPLY 
Skylar May 8, 2020 at 4:28 am #

Hi Jason,

Very clear and nice post! I have two confused points:

1. what is the relationship between data preprocessing and feature engineering? From your post, if I understand
correctly, it seems that data preprocessing has bigger scope than feature engineering, and feature engineering is
included in the data preprocessing, am I right?
2. I found from some websites that some people first do data splitting to the training and test datasets, and then do
data preprocessing (e.g. scaling and centering) separately. I wonder whether it is a correct order to do like this?
Because I understand we should first do data preprocessing (including feature engineering) for the entire dataset,
and then do data splitting, right?

Thank you in advance and look forward to your answers and suggestions!

REPLY 
Jason Brownlee May 8, 2020 at 6:42 am #

Thanks!

Great questions!

Some refer to data prep as “feature engineering”. Some refer to feature engineering as a subset of data prep
focused on creating new inputs form existing inputs. I like the latter definition.

It is correct to prepare data prep methods on train, then apply them to train, test, val, others. This is to avoid data
leakage.

REPLY 
Skylar May 8, 2020 at 7:38 am #

Thank you Jason for your reply! So to avoid data leakage, we should first split data to training and
test dataset; then do data preprocessing (e.g. scaling, centering) on training, and pass this preprocessing
method and apply on test dataset, right?

REPLY 
Jason Brownlee May 8, 2020 at 8:03 am #

Yes, or if using cross-validation, do data prep on the train folds and apply to train/test folds.

Skylar May 8, 2020 at 8:13 am #

Got it, many thanks!

Jason Brownlee May 8, 2020 at 11:19 am #

You’re welcome.

REPLY 
Mer June 3, 2020 at 8:15 am #

Hi Jason!

Very good article, I am working on a project measuring pressure from a device, If i get the raw data from that device
in binary form (a lot of values for one measurement), do you think I should pre-process this data? I mean, like get a
number for the total pressure or so, or can I use that raw data as one feature (column) to store in the database to be
used as train data?

thank you in advance! 🙂

REPLY 
Jason Brownlee June 3, 2020 at 1:14 pm #

Thanks!

Perhaps try fitting a model on the raw data and compare results with different data preparation methods to see if
you can lift the performance of the model.

REPLY 
Ravi Malvia July 3, 2020 at 8:42 pm #

Thank you sir for such a nice content

I Always wait for your help email and your content gives me a step by step process for machine learning

REPLY 
Jason Brownlee July 4, 2020 at 5:59 am #

Thank you!

REPLY 
Jean-Christophe Chouinard April 16, 2021 at 10:31 am #

I do relate to running machine learning algorithms on subsets of data while building models. I remember my
first Machine learning project. I was following a tutorial, and I used a dataset with 100M rows. Took me forever to
complete the tutorial 😛

REPLY 
Jason Brownlee April 17, 2021 at 6:04 am #

Thanks.
Leave a Reply

Name (required)

Email (will not be published) (required)

SUBMIT COMMENT

Welcome!
I'm Jason Brownlee PhD
and I help developers get results with machine learning.
Read more

Never miss a tutorial:

Picked for you:

How to Choose a Feature Selection Method For Machine Learning

Data Preparation for Machine Learning (7-Day Mini-Course)

How to Calculate Feature Importance With Python

Recursive Feature Elimination (RFE) for Feature Selection in Python

How to Remove Outliers for Machine Learning

Loving the Tutorials?

The Data Preparation EBook is

where you'll find the Really Good stuff.

>> SEE WHAT'S INSIDE

LinkedIn | Twitter | Facebook | Newsletter | RSS

Privacy | Disclaimer | Terms | Contact | Sitemap | Search

Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
CV NADIA
No ratings yet
CV NADIA
69 pages
ML_DA
No ratings yet
ML_DA
55 pages
Unit_I_1
No ratings yet
Unit_I_1
203 pages
1725892639Module 3 the Machine Learning Process
No ratings yet
1725892639Module 3 the Machine Learning Process
17 pages
Masonry Basic Drawing
100% (1)
Masonry Basic Drawing
97 pages
Lecture No 2 Data Preparation
No ratings yet
Lecture No 2 Data Preparation
23 pages
UNIT - 2 ML
No ratings yet
UNIT - 2 ML
8 pages
Session 4 Machine Learning Process (1)
No ratings yet
Session 4 Machine Learning Process (1)
28 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
dm(2)
No ratings yet
dm(2)
3 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Unit 4_Question Bank and answers
No ratings yet
Unit 4_Question Bank and answers
23 pages
UNIT 2 ML
No ratings yet
UNIT 2 ML
14 pages
Unit 1
No ratings yet
Unit 1
95 pages
ISW511
No ratings yet
ISW511
2 pages
Ffhnas PROJECT Doremi Latest
No ratings yet
Ffhnas PROJECT Doremi Latest
73 pages
data mining steps
No ratings yet
data mining steps
3 pages
HR Fundamentals
No ratings yet
HR Fundamentals
367 pages
Machine learning Life cycle
No ratings yet
Machine learning Life cycle
11 pages
DMA Handbook Fall 2021 - Final Version
No ratings yet
DMA Handbook Fall 2021 - Final Version
58 pages
data preprocessing
No ratings yet
data preprocessing
8 pages
CMR BDA Data Pre Processing
No ratings yet
CMR BDA Data Pre Processing
10 pages
S-9
No ratings yet
S-9
18 pages
Subnetting
No ratings yet
Subnetting
10 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
NN-7
No ratings yet
NN-7
26 pages
ML_1
No ratings yet
ML_1
13 pages
UNIT - 2 ML
No ratings yet
UNIT - 2 ML
8 pages
GT Quick Start Installation Guide
No ratings yet
GT Quick Start Installation Guide
24 pages
UNIT 1 PART 4
No ratings yet
UNIT 1 PART 4
8 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
U1_DA_Data Preprocessing
No ratings yet
U1_DA_Data Preprocessing
6 pages
8 Tactics To Combat Imbalanced Classes in Your Machine Learning Dataset
No ratings yet
8 Tactics To Combat Imbalanced Classes in Your Machine Learning Dataset
62 pages
Mertz-A Gentle Introduction To PythonTeX
No ratings yet
Mertz-A Gentle Introduction To PythonTeX
46 pages
Art of Money Aom-2016-Healing-Fillable-Revised
No ratings yet
Art of Money Aom-2016-Healing-Fillable-Revised
9 pages
Metals 14 00087
No ratings yet
Metals 14 00087
15 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Data Preprocessing
No ratings yet
Data Preprocessing
4 pages
Presentation-2 Data Pre-Processing in Machine Learning
No ratings yet
Presentation-2 Data Pre-Processing in Machine Learning
11 pages
Zohar 33
No ratings yet
Zohar 33
96 pages
Data Preprocessing in Machine Learning[1]
No ratings yet
Data Preprocessing in Machine Learning[1]
24 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Comparative Analysis of TVET Sector in Pakistan
No ratings yet
Comparative Analysis of TVET Sector in Pakistan
60 pages
Part 2 Introduction To ML
No ratings yet
Part 2 Introduction To ML
13 pages
Winston's On The Beach Menu
No ratings yet
Winston's On The Beach Menu
4 pages
Unit 2
No ratings yet
Unit 2
18 pages
Data Preparation: January 2017
No ratings yet
Data Preparation: January 2017
15 pages
ML Lect1
100% (1)
ML Lect1
51 pages
BCC Handbook
No ratings yet
BCC Handbook
112 pages
Improve Model Accuracy With Data Pre-Processing
No ratings yet
Improve Model Accuracy With Data Pre-Processing
11 pages
Data Preparation For Machine Learning Mini Course
No ratings yet
Data Preparation For Machine Learning Mini Course
19 pages
Passive Voice Material
No ratings yet
Passive Voice Material
1 page
Pilot
No ratings yet
Pilot
78 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
Chapter 2 Preparing To Model
No ratings yet
Chapter 2 Preparing To Model
49 pages
Chương
No ratings yet
Chương
12 pages
Airport Markings and Lighting: Associate Professor of Civil Engineering Virginia Tech
No ratings yet
Airport Markings and Lighting: Associate Professor of Civil Engineering Virginia Tech
31 pages
Cryogenic Ball Milling - A Key For Elemental Analysis of Plastic-Rich Automotive Shedder Residue
No ratings yet
Cryogenic Ball Milling - A Key For Elemental Analysis of Plastic-Rich Automotive Shedder Residue
9 pages
2022-23 NMMS Exam Circular
No ratings yet
2022-23 NMMS Exam Circular
14 pages
How To Choose The Right Test Options When Evaluating Machine Learning Algorithms
No ratings yet
How To Choose The Right Test Options When Evaluating Machine Learning Algorithms
16 pages
Chapter 2 Data Preprocessing
No ratings yet
Chapter 2 Data Preprocessing
23 pages
Build A Machine Learning Portfolio
No ratings yet
Build A Machine Learning Portfolio
18 pages
Festival Ielts Speaking
No ratings yet
Festival Ielts Speaking
3 pages
The University of Sanbra Guide To Intelligent Life - The Nazren
No ratings yet
The University of Sanbra Guide To Intelligent Life - The Nazren
5 pages
Industry 1
No ratings yet
Industry 1
10 pages
Machine Learning Tips
No ratings yet
Machine Learning Tips
2 pages
Module 2
No ratings yet
Module 2
8 pages
In The Merry Old Land of Oz
No ratings yet
In The Merry Old Land of Oz
12 pages
Preparing Data For Machine Learning - Pluralsight PDF
No ratings yet
Preparing Data For Machine Learning - Pluralsight PDF
74 pages
K4A Selichot PDF
No ratings yet
K4A Selichot PDF
38 pages
Experiment 01: AIM: To Perform Data Preparation Using Numpy and Panda. Theory
No ratings yet
Experiment 01: AIM: To Perform Data Preparation Using Numpy and Panda. Theory
5 pages
Six Steps To Master Machine Learning With Data Preparation
No ratings yet
Six Steps To Master Machine Learning With Data Preparation
44 pages
2013-ON 2.0 TSI Longitudinal Engine Atmospheric/Recirculating Valve Installation
No ratings yet
2013-ON 2.0 TSI Longitudinal Engine Atmospheric/Recirculating Valve Installation
16 pages
Machine Learning Project Checklist
100% (1)
Machine Learning Project Checklist
10 pages
Designing Machine Learning Systems With Python - Sample Chapter
100% (1)
Designing Machine Learning Systems With Python - Sample Chapter
31 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Data Preparation For Automated Machine Learning: White Paper
No ratings yet
Data Preparation For Automated Machine Learning: White Paper
21 pages
How To Apply ML
No ratings yet
How To Apply ML
4 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
Certificate of Registration: Information Security Management System - ISO/IEC 27001:2013
No ratings yet
Certificate of Registration: Information Security Management System - ISO/IEC 27001:2013
1 page
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
Material Handling 2009
No ratings yet
Material Handling 2009
45 pages
Lesson Plan in Applied Economics Semi Detailed Final
No ratings yet
Lesson Plan in Applied Economics Semi Detailed Final
4 pages
Software Verification: AISC-360-16 Example 003
No ratings yet
Software Verification: AISC-360-16 Example 003
5 pages
Colakoglu Metalurji Brosur
No ratings yet
Colakoglu Metalurji Brosur
5 pages
PYTHON MACHINE LEARNING: Leveraging Python for Implementing Machine Learning Algorithms and Applications (2023 Guide)
From Everand
PYTHON MACHINE LEARNING: Leveraging Python for Implementing Machine Learning Algorithms and Applications (2023 Guide)
Roberta Bowman
No ratings yet
Data Cleaning with Power BI: The definitive guide to transforming dirty data into actionable insights
From Everand
Data Cleaning with Power BI: The definitive guide to transforming dirty data into actionable insights
Gus Frazer
No ratings yet
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet

How To Prepare Data For Machine Learning

Uploaded by

How To Prepare Data For Machine Learning

Uploaded by

 Navigation

Click to Take the FREE Data Preparation Crash-Course

How to Prepare Data For Machine Learning

Share Tweet Share

Let’s get started.

Data Preparation Process

Step 1: Select Data

Want to Get Started With Data Preparation?

Download Your FREE Mini-Course

Step 1: Select Data

Step 2: Preprocess Data

Step 3: Transform Data

From Data Mining to Knowledge Discovery in Databases, 1996

Get a Handle on Modern Data Preparation!

Discover how in my new Ebook:

It provides self-study tutorials with full working code on:

Bring Modern Data Preparation Techniques to

SEE WHAT'S INSIDE

Share Tweet Share

More On This Topic

About Jason Brownlee

165 Responses to How to Prepare Data For Machine Learning

I enjoyed your concise overview, Jason.

Hi Fraser, good question.

For me this is what makes data analysis fun.

The use of the angle brackets got lost in my post above.

Thanks, Jason. I will do that. Fraser

Can you please share the link of this article

I believe Surajit was quoting from this article.

Great set of articles!

Any insights and anecdotes about this issue?

Appreciate the effort you put into the great article.

Keeping the 80/20 rule in mind.

Nice work Ivan.

I have two datasets:

Thank you for your help

Very good question José!

Just a few examples.

Really good points Eric.

Jason Brownlee September 30, 2017 at 7:36 am #

Perhaps this tutorial will get you started:

Perhaps they define univariate in terms of output variable only.

I have some ideas here that may help:

Thx for your time, sry about the big text.

Perhaps try a few methods and see which is easier to model.

Sorry, but I didn’t get the answer.

Good article, Jason.

Here are some examples:

Create and save a new dataset or views on your raw data.

Do i need to do data transformation?

I am trying to use spark ML.

Sorry, I don’t have examples for Spark.

any help or pointers would be greatly appreciated.

Let me know how you go.

hi Jason , i like a lot your way to explain machine learning.

Here are some places to get datasets:

It is a very good guide.

Perhaps this post will help:

Perhaps clustering and filter based on distance to cluster centroids?

Perhaps check the literature?

Try both and see what works best.

This is a common question that I answer here:

A good place to get started is here:

You might have to write some custom code.

Perhaps this will help:

You’re approach sounds like a good start.

A testset with one instance does not make sense.

Sounds like the right approach.

Perhaps you can mark the values as missing:

I am in the right path then, Thanks Jason

On the point of Data Cleaning:

We have ML running on a data lake containing raw system data.

Case by case really, based on the type of consent users gave.

If you have a pipeline, it will perform the preparation for you.

Decomposition: a date can be split into date/month/year.

Aggregation: customer transactions can be aggregated to give sums and averages.