Practice Machine Learning With Datasets From The UCI Machine Learning Repository
Practice Machine Learning With Datasets From The UCI Machine Learning Repository
Search...
Datasets that are real-world so that they are interesting and relevant, although small enough for you to review in
Excel and work through on your desktop.
In this post you will discover a database of high-quality, real-world, and well understood machine learning datasets
that you can use to practice applied machine learning.
This database is called the UCI machine learning repository and you can use it to structure a self-study program
and build a solid foundation in machine learning.
I teach a top-down approach to machine learning where I encourage you to learn a process for working a problem
end-to-end, map that process onto a tool and practice the process on data in a targeted way. For more information
see my post “Machine Learning for Programmers: Leap from developer to machine learning practitioner“.
I recommend you select traits that you will encounter and need to address when you start working on problems of
your own such as:
You can create a program of traits to study and learn about and the algorithm you need to address them, by
designing a program of test problem datasets to work through.
Real-World: The datasets should be drawn from the real world (rather than being contrived). This will keep
them interesting and introduce the challenges that come with real data.
Small: The datasets need to be small so that you can inspect and understand them and that you can run many
models quickly to accelerate your learning cycle.
Well-Understood: There should be a clear idea of what the data contains, why it was collected, what the
problem is that needs to be solved so that you can frame your investigation.
Baseline: It is also important to have an idea of what algorithms are known to perform well and the scores they
achieved so that you have a useful point of comparison. This is important when you are getting started and
learning because you need quick feedback as to how well you are performing (close to state-of-the-art or
something is broken).
Plentiful: You need many datasets to choose from, both to satisfy the traits you would like to investigate and (if
possible) your natural curiosity and interests.
For beginners, you can get everything you need and more in terms of datasets to practice on from the UCI Machine
Learning Repository.
For more than 25 years it has been the go-to place for machine learning researchers and machine learning
practitioners that need a dataset.
Each dataset gets its own webpage that lists all the details known about it including any relevant publications that
investigate it. The datasets themselves can be downloaded as ASCII files, often the useful CSV format.
For example, here is the webpage for the Abalone Data Set that requires the prediction of the age of abalone from
their physical measurements.
Almost all datasets are drawn from the domain (as opposed to being synthetic), meaning that they have real-
world qualities.
Datasets cover a wide range of subject matter from biology to particle physics.
The details of datasets are summarized by aspects like attribute types, number of instances, number of
attributes and year published that can be sorted and searched.
Datasets are well studied which means that they are well known in terms of interesting properties and
expected “good” results. This can provide a useful baseline for comparison.
Most datasets are small (hundreds to thousands of instances) meaning that you can readily load them in a text
editor or MS Excel and review them, you can also easily model them quickly on your workstation.
Browse the 300+ datasets using this handy table that supports sorting and searching.
The datasets are cleaned, meaning that the researchers that prepared them have often already performed
some pre-processing in terms of the the selection of attributes and instances.
The datasets are small, this is not helpful if you are interested in investigating larger scale problems and
techniques.
There are so many to choose from that you can be frozen by indecision and over-analysis. It can be hard to
just pick a dataset and get started when you are unsure if it is a “good dataset” for what you’re investigating.
Datasets are limited to tabular data, primarily for classification (although clustering and regression datasets are
listed). This is limiting for those interested in natural language, computer vision, recommender and other data.
Take a look at the repository homepage as it shows featured datasets, the newest datasets as well as which
datasets are currently the most popular.
A Self-Study Program
So, how can you make the best use of the UCI machine learning repository?
I would advise you to think about the traits in problem datasets that you would like to learn about.
These may be traits that you would like to model (like regression), or algorithms that model these traits that you
would like to get more skillful at using (like random forest for multi-class classification).
This is just a list of traits, can pick and choose your own traits to investigate.
I have listed one dataset for each trait, but you could pick 2-3 different datasets and complete a few small projects
to improve your understanding and put in more practice.
For each problem, I would advise that you work it systematically from end-to-end, for example, go through the
following steps in the applied machine learning process:
For more on the process of working through a machine learning problem systematically, see my post titled “Process
for working through Machine Learning Problems“.
It allows you to build up a portfolio of projects that you refer back to as a reference on future projects and get a
jump-start, as well as use as a public resume or your growing skills and capabilities in applied machine learning.
For more on building a portfolio of projects, see my post “Build a Machine Learning Portfolio: Complete Small
Focused Projects and Demonstrate Your Skills“.
You choose the level of detail to investigate and it is a good idea to keep it light and simple when just starting out.
Action Step
Select a dataset and get started.
If you are serious about your self-study, consider designing a modest list of traits and corresponding datasets to
investigate.
You will learn a lot and build a valuable foundation for diving into more complex and interesting problems.
Did you find this post useful? Leave a comment and let me know.
119 Responses to Practice Machine Learning with Datasets from the UCI Machine
Learning Repository
REPLY
hossein September 11, 2015 at 3:22 pm #
dear Jason,
You are the best teacher.because you make simple things.
REPLY
Jason Brownlee September 11, 2015 at 5:22 pm #
Thanks hossein!
REPLY
Satyam January 9, 2020 at 4:00 am #
Awesome post for any newbies in Data Science, really appreciate the work.
REPLY
Jason Brownlee January 9, 2020 at 7:32 am #
Thanks!
REPLY
Khin myo myat April 10, 2020 at 9:23 pm #
Thank you so much Sir Jason.I am surely looking forward to pracitsing like you suggest
REPLY
Jason Brownlee April 11, 2020 at 6:18 am #
You’re welcome.
REPLY
Rash December 2, 2015 at 8:12 pm #
Hi Jason,
Thank you for this great post. I just began my study of data analysis and was totally confused when to began doing
projects. This post is truly enlightening.
REPLY
Justin December 25, 2015 at 8:05 pm #
You mention something that is confusing… “For example, here is the webpage for the Abalone Data Set that
requires the prediction of the age of abalone from their physical measurements.”
Why do you use the word “requires”? The webpage requires… Or the dataset requires? No. Confuses.
Wouldn’t this make more sense…”The dataset provides content to the learning machine to predict the age of an
Abalone from physical measurements.”
REPLY
abhijit kamune April 5, 2016 at 1:59 am #
REPLY
Jason Brownlee April 8, 2016 at 1:37 pm #
Hi, could you recommend me one or a few data sets on computer system resources usage just for the
purpose of machine learning ?
Thanks in advance.
REPLY
Adam June 25, 2016 at 6:55 am #
Hi Jason,
Regards,
Adam
REPLY
Jason Brownlee June 26, 2016 at 5:59 am #
Hi Adam, take a look at this process for working through an applied machine learning problem:
https://machinelearningmastery.com/process-for-working-through-machine-learning-problems/
REPLY
Elton June 30, 2016 at 10:00 am #
REPLY
Jason Brownlee June 30, 2016 at 10:31 am #
REPLY
Liem Le July 12, 2016 at 10:41 pm #
As a naive programmer, recently graduate from Clg, your posts is what I looking for.
Thank.
REPLY
Jason Brownlee July 13, 2016 at 5:00 am #
Hey Jason, this is really nicely broken down into steps. I always felt that I get too involved into the problems
that I miss the big picture but I think keeping a process and working through it is a good way to approach learning.
And I am definitely looking forward to practising like you suggest. Thanks!
REPLY
Jason Brownlee August 3, 2016 at 5:53 am #
REPLY
kay August 14, 2016 at 6:49 am #
REPLY
Jason Brownlee August 15, 2016 at 12:35 pm #
REPLY
Kartik August 18, 2016 at 4:08 pm #
Hey Jason,
Got a nice link flow is nice in simple words and detailed explanation.
Thanks
REPLY
Jason Brownlee August 19, 2016 at 5:23 am #
REPLY
Lumilog September 9, 2016 at 9:08 am #
Hi Jason!
How do you handle the datasets not seeming to have any benchmarks for what a poor, fair, or good accuracy is for
prediction? If I pick some binary classification dataset to practice on and get say, an ROC = 0.6, how am I to know if
that’s a fantastic result or there’s still a lot of improving I could do with respect to how others have done?
Thanks
REPLY
Jason Brownlee September 10, 2016 at 7:10 am #
Great question!
The answer is to use ZeroR or similar to baseline the problem and determine the point from which all other
results can be compared. After you run through a suite of good standard algorithms you will get a feel for what
result is “easy” to achieve, providing a new baseline from which to improve.
REPLY
almas November 16, 2016 at 9:25 pm #
REPLY
Jason Brownlee November 17, 2016 at 9:53 am #
REPLY
LinboLee November 21, 2016 at 12:42 am #
Hello Jason,
Thanks for your post, it is very helpful. But I have one question, which is how to validate your results or your
implemented algorithms? How to compare our results with a better one?
Thanks.
REPLY
Jason Brownlee November 22, 2016 at 6:50 am #
You can evaluate the performance of your models by estimating their performance on unseen data. You can do
this with resampling methods like k-fold cross validation.
You can then compare the skill of multiple algorithms on the problem.
You can compare to previously published results by re-creating their test setup.
REPLY
yexiuqiang May 29, 2017 at 6:52 pm #
I think I get the point for how to learn machine learning. Thank you.
REPLY
Jason Brownlee June 2, 2017 at 12:23 pm #
REPLY
Priyank June 4, 2017 at 4:52 am #
Thanks for such a freat article, You are working great,
Sir!
REPLY
Jason Brownlee June 4, 2017 at 7:55 am #
Thanks Priyank.
REPLY
Kotrappa Sirbi June 19, 2017 at 9:03 pm #
REPLY
Jason Brownlee June 20, 2017 at 6:36 am #
REPLY
Lautaro June 29, 2017 at 3:55 am #
Very good article, as always you can articulate the theoretical and practical issues in predictive modeling.
I also recommend kaggle data sets. They are also free, have big and small data sets.
REPLY
Jason Brownlee June 29, 2017 at 6:39 am #
Great suggestion.
REPLY
Francisco Cabrera July 21, 2017 at 5:13 am #
Thank you for this refreshing article, Jason! Although your explanations are simple, they are deep and very
well thought at the same time. I am learning a lot from your writings.
Once again, thank you for sharing your wisdom and knowledge with us.
REPLY
Jason Brownlee July 21, 2017 at 9:36 am #
REPLY
lab August 10, 2017 at 8:42 pm #
How are you you can add this mine of good and open data sets http://www.andbrain.com/
REPLY
Jason Brownlee August 11, 2017 at 6:41 am #
REPLY
Nada September 12, 2017 at 2:40 am #
You are awesome Jason. I have been looking for such a map for a long time! Thank you so much.
REPLY
Jason Brownlee September 13, 2017 at 12:26 pm #
REPLY
huda September 27, 2017 at 9:20 pm #
that is a valuable word you are really motivating me to work hard to know everything about Machine
Learning.
REPLY
Jason Brownlee September 28, 2017 at 5:25 am #
REPLY
Ali October 17, 2017 at 10:34 pm #
Thanks a lot Jason for providing invaluable information about Machine Learning.
REPLY
Jason Brownlee October 18, 2017 at 5:36 am #
REPLY
shivaprasad October 24, 2017 at 6:58 pm #
REPLY
Jason Brownlee October 25, 2017 at 6:44 am #
REPLY
Jason Brownlee November 7, 2017 at 9:49 am #
REPLY
Shane November 28, 2017 at 8:39 am #
I love how you break down the types of machine learning problems. This is a great resource!
REPLY
Jason Brownlee November 28, 2017 at 8:46 am #
Thanks Shane.
REPLY
Parvez Khan January 8, 2018 at 2:56 pm #
Hi Jason
REPLY
Jason Brownlee January 8, 2018 at 3:55 pm #
REPLY
S B Iqbal January 9, 2018 at 5:14 pm #
Hi Jason,
Could you also advice on how to scrap data from UC Irvine database using R. It would be great to see a tutorial on
that.
REPLY
Jason Brownlee January 10, 2018 at 5:21 am #
No need to scrape the dataset, you can download them directly as CSV files.
REPLY
praveen kumar March 5, 2018 at 5:48 pm #
How can i prepare my own dataset? Can you suggest me the path?
Here Raw data may be either images or integer array or character array or strings.
REPLY
Jason Brownlee March 6, 2018 at 6:09 am #
REPLY
Mustafa March 13, 2018 at 5:49 pm #
REPLY
Jason Brownlee March 14, 2018 at 6:17 am #
You’re welcome.
REPLY
Asma May 11, 2018 at 8:47 pm #
Hi Jason,
I want to prepare a white paper submission on Responsible AI or Ethical AI.Can you suggest any usecase or problem
statement for it
REPLY
Jason Brownlee May 12, 2018 at 6:31 am #
REPLY
Arjun June 29, 2018 at 11:25 pm #
I wish i could be in regular touch with you bacause i want to be a REAL good Data Scientist and you
REALLY know the path which can lead one there.
This website is the best source for learning machine learning.
REPLY
Jason Brownlee June 30, 2018 at 6:08 am #
You are in touch with me, ask questions any time via comments or via the contact form.
My best advice is here:
https://machinelearningmastery.com/start-here/
REPLY
Aditya July 3, 2018 at 4:50 am #
REPLY
Jason Brownlee July 3, 2018 at 6:29 am #
REPLY
Amrit July 14, 2018 at 8:47 pm #
hello sir
can you please guide me the data set for urban water supply
REPLY
Jason Brownlee July 15, 2018 at 6:12 am #
It is the default value. You can learn more about how to configure the model here:
https://radimrehurek.com/gensim/models/keyedvectors.html
REPLY
Paul A. Gureghian July 21, 2018 at 5:13 am #
how to download a dataset from UCI? just the usual way in Python and R ? is there a download link on the
site ?
REPLY
Jason Brownlee July 21, 2018 at 6:40 am #
They have a download link and you can use a web browser.
REPLY
Winnie November 13, 2018 at 12:40 am #
REPLY
Jason Brownlee November 13, 2018 at 5:46 am #
Good post. I am in applied machine learning application. There is need to evaluate algorithms on good
datasets.
REPLY
Jason Brownlee November 20, 2018 at 6:32 am #
Thanks.
REPLY
Theresa November 21, 2018 at 10:39 pm #
thank you Jason. you have no idea of how helpful this is to me now. God bless
REPLY
Jason Brownlee November 22, 2018 at 6:24 am #
REPLY
Pooja December 21, 2018 at 6:01 pm #
Hello sir,
Thank you for such a nice information, it is very simple to understand. As a student of M Sc (Statistics), i m looking
for project in data mining, can you suggest something?
REPLY
Jason Brownlee December 22, 2018 at 6:03 am #
REPLY
Viva March 9, 2019 at 7:20 pm #
Hi Jason,
Thanks for your articles. I have recently started reading your page and articles. I am a practicing analyst who enjoys
to play around data, what I lack is systematic approach to implementation of algorithms, I know them theoretically but
don’t have the confidence on implementing them. I have also joined mailing subscription from your website and also
reading your number of articles to start working with a plan. I generally get lost and overwhelmed in my learning
process and hence leave it between. Would request you to help me on how can I keep my learning process
productive.
REPLY
Jason Brownlee March 10, 2019 at 8:16 am #
Try working through this tutorial:
https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
REPLY
Shaima March 19, 2019 at 1:22 am #
REPLY
Jason Brownlee March 19, 2019 at 8:58 am #
REPLY
Priya August 17, 2019 at 2:27 am #
where i can get plant disease dataset for machine learning, can anyone please suggest me..
REPLY
Jason Brownlee August 17, 2019 at 5:56 am #
REPLY
Manjunath September 11, 2019 at 4:20 pm #
REPLY
Jason Brownlee September 12, 2019 at 5:12 am #
REPLY
shadrack kodondi October 16, 2019 at 9:05 pm #
after hovering around so many sites,i came here,the best i have ever visted for ML introductions…thanks so
much Jason
REPLY
Jason Brownlee October 17, 2019 at 6:31 am #
Thanks!
REPLY
Adarsh January 30, 2020 at 7:08 pm #
Hi Jason Sir,
How do I get the csv file from the UCI repository…………i am getting a txt file that is getting opened by Notepad
PLz help fast
REPLY
Jason Brownlee January 31, 2020 at 7:43 am #
REPLY
Chao March 17, 2020 at 2:17 am #
Hi, Jason.
Regarding the datsets from UCI repository, I’m wondering how I get csv format. Because I found that the files there
are with extension .data, not .csv.
REPLY
Jason Brownlee March 17, 2020 at 8:18 am #
You might need to convert some to CSV format. Some might have .data extension and already have a
CSV format.
REPLY
Peggy March 17, 2020 at 3:01 pm #
Thanks Jason, it is a wonderful tutorial for me to start learning machine learning. It gives me confidence to
continue the study.
REPLY
Jason Brownlee March 18, 2020 at 6:07 am #
REPLY
Fashi Uddin April 15, 2020 at 4:58 pm #
Wish I have this in my early time when I was starting with Data Science. Practice is the key for sure reading
soo many books will give you knowledge about the process but in one or two directions. Its practice which gives you
the exposure for real life scenarios.
REPLY
Jason Brownlee April 16, 2020 at 5:58 am #
Thanks!
I agree.
REPLY
Sachin Koti May 1, 2020 at 11:00 pm #
Wonderfully explained…
Thanks Jason!!
Knowledge grows by sharing and you are already great in doing that.
REPLY
Jason Brownlee May 2, 2020 at 5:47 am #
Thanks!
REPLY
wysohn May 6, 2020 at 3:44 pm #
This is the only site I often come back, and I think it simply shows how valuable the information you share is!
Thank you so much for spending time and putting lots of effort in doing this.
REPLY
Jason Brownlee May 7, 2020 at 6:37 am #
REPLY
Simon J Samaha June 10, 2020 at 2:33 pm #
Awesome insights.
Now i have experiment with weka 😉
REPLY
Jason Brownlee June 11, 2020 at 5:49 am #
You’re welcome.
REPLY
salwa rizk August 4, 2020 at 5:55 pm #
You’re welcome!
REPLY
robin August 19, 2020 at 4:36 am #
@Jason ,
Thanks for your great work. It has improved my ML knowledge and increased my interest. I was wondering if there
are other ML repository you know of, specially, the ones that have raw datasets- just for the sake of working on my
data cleaning/pre-processing skills?
Thanks!
REPLY
Jason Brownlee August 19, 2020 at 6:06 am #
And this:
https://github.com/jbrownlee/Datasets
REPLY
sai sowmya grandhi August 25, 2020 at 3:21 pm #
You made me feel that coding is not big deal as everybody exaggerates it. I have started using R
programming only because of you. Thanks for the confidence.
REPLY
Jason Brownlee August 26, 2020 at 6:44 am #
Yes, it’s not a big deal – just another tool for us to use to get a job done, like writing.
REPLY
java37 November 1, 2020 at 8:08 am #
Much obliged to you for your posts which are so useful to me.
Concerning datsets from UCI vault, I’m considering how I get csv design. Since I found that the records there are
with expansion .data, not .csv.
REPLY
Jason Brownlee November 1, 2020 at 8:26 am #
REPLY
Jean-Christophe Chouinard February 26, 2021 at 2:38 pm #
Thanks for the list of models with their classifications, makes it easier to start.
REPLY
Jason Brownlee February 27, 2021 at 6:00 am #
You’re welcome.
REPLY
Nhung July 5, 2021 at 7:09 pm #
REPLY
Jason Brownlee July 6, 2021 at 5:47 am #
You’re welcome.
REPLY
Niaz Ahmad February 22, 2022 at 2:31 pm #
Sir, your work is greatly appreciated, kindly clear me at a point i want to detect a plant/fruit diseases system,
will it be better for me to use the existing datasets or to prepare my own. thanks and regards
REPLY
James Carmichael February 23, 2022 at 12:30 pm #
https://machinelearningmastery.com/how-to-improve-performance-with-transfer-learning-for-deep-learning-
neural-networks/
REPLY
K.Soumaia October 1, 2022 at 6:14 am #
REPLY
James Carmichael October 1, 2022 at 6:52 am #
Name (required)
SUBMIT COMMENT
Welcome!
I'm Jason Brownlee PhD
and I help developers get results with machine learning.
Read more