0% found this document useful (0 votes)

253 views23 pages

Practice Machine Learning With Datasets From The UCI Machine Learning Repository

The document discusses the UCI Machine Learning Repository, a database of machine learning datasets that can be used to practice machine learning. Some key points: - The repository contains over 300 real-world, small-scale datasets spanning different domains, sizes, attribute types, and more. This allows users to target specific traits to learn. - The datasets are well-studied and come with details about expected results, providing a performance baseline for users. - The document recommends using a self-study program with the repository, where users complete end-to-end projects on datasets targeting specific traits to improve their skills and build a portfolio. - Examples of targeted traits include binary classification, multi-class classification

Uploaded by

prediatech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

253 views23 pages

Practice Machine Learning With Datasets From The UCI Machine Learning Repository

Uploaded by

prediatech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

 Navigation

Click to Take the FREE Crash-Course

Search... 

Practice Machine Learning with Datasets from the UCI

Machine Learning Repository
by Jason Brownlee on July 5, 2019 in Start Machine Learning  119

Share Tweet Share

Where can you get good datasets to practice machine learning?

Datasets that are real-world so that they are interesting and relevant, although small enough for you to review in
Excel and work through on your desktop.

In this post you will discover a database of high-quality, real-world, and well understood machine learning datasets
that you can use to practice applied machine learning.

This database is called the UCI machine learning repository and you can use it to structure a self-study program
and build a solid foundation in machine learning.

Practice Practice Practice

Photo by Phil Roeder, some rights reserved.
Why Do We Need Practice Datasets?
If you are interested in practicing applied machine learning, you need datasets on which to practice.

This problem can stop you dead.

Which dataset should you use?

Should you collect your own or use one off the shelf?
Which one and why?

I teach a top-down approach to machine learning where I encourage you to learn a process for working a problem
end-to-end, map that process onto a tool and practice the process on data in a targeted way. For more information
see my post “Machine Learning for Programmers: Leap from developer to machine learning practitioner“.

So How Do You Practice In A Targeted Way?

I teach that the best way to get started is to practice on datasets that have specific traits.

I recommend you select traits that you will encounter and need to address when you start working on problems of
your own such as:

Different types of supervised learning such as classification and regression.

Different sized datasets from tens, hundreds, thousands and millions of instances.
Different numbers of attributes from less than ten, tens, hundreds and thousands of attributes
Different attribute types from real, integer, categorical, ordinal and mixtures
Different domains that force you to quickly understand and characterize a new problem in which you have no
previous experience.

You can create a program of traits to study and learn about and the algorithm you need to address them, by
designing a program of test problem datasets to work through.

Such a program has a number of practical requirements, for example:

Real-World: The datasets should be drawn from the real world (rather than being contrived). This will keep
them interesting and introduce the challenges that come with real data.
Small: The datasets need to be small so that you can inspect and understand them and that you can run many
models quickly to accelerate your learning cycle.
Well-Understood: There should be a clear idea of what the data contains, why it was collected, what the
problem is that needs to be solved so that you can frame your investigation.
Baseline: It is also important to have an idea of what algorithms are known to perform well and the scores they
achieved so that you have a useful point of comparison. This is important when you are getting started and
learning because you need quick feedback as to how well you are performing (close to state-of-the-art or
something is broken).
Plentiful: You need many datasets to choose from, both to satisfy the traits you would like to investigate and (if
possible) your natural curiosity and interests.

For beginners, you can get everything you need and more in terms of datasets to practice on from the UCI Machine
Learning Repository.

What is the UCI Machine Learning Repository?

The UCI Machine Learning Repository is a database of machine learning problems that you can access for free.
It is hosted and maintained by the Center for Machine Learning and Intelligent Systems at the University of
California, Irvine. It was originally created by David Aha as a graduate student at UC Irvine.

For more than 25 years it has been the go-to place for machine learning researchers and machine learning
practitioners that need a dataset.

UCI Machine Learning Repository

Each dataset gets its own webpage that lists all the details known about it including any relevant publications that
investigate it. The datasets themselves can be downloaded as ASCII files, often the useful CSV format.

For example, here is the webpage for the Abalone Data Set that requires the prediction of the age of abalone from
their physical measurements.

Benefits of the Repository

Some beneficial features of the library include:

Almost all datasets are drawn from the domain (as opposed to being synthetic), meaning that they have real-
world qualities.
Datasets cover a wide range of subject matter from biology to particle physics.
The details of datasets are summarized by aspects like attribute types, number of instances, number of
attributes and year published that can be sorted and searched.
Datasets are well studied which means that they are well known in terms of interesting properties and
expected “good” results. This can provide a useful baseline for comparison.
Most datasets are small (hundreds to thousands of instances) meaning that you can readily load them in a text
editor or MS Excel and review them, you can also easily model them quickly on your workstation.

Browse the 300+ datasets using this handy table that supports sorting and searching.

Criticisms of the Repository

Some criticisms of the repository include:

The datasets are cleaned, meaning that the researchers that prepared them have often already performed
some pre-processing in terms of the the selection of attributes and instances.
The datasets are small, this is not helpful if you are interested in investigating larger scale problems and
techniques.
There are so many to choose from that you can be frozen by indecision and over-analysis. It can be hard to
just pick a dataset and get started when you are unsure if it is a “good dataset” for what you’re investigating.
Datasets are limited to tabular data, primarily for classification (although clustering and regression datasets are
listed). This is limiting for those interested in natural language, computer vision, recommender and other data.

Take a look at the repository homepage as it shows featured datasets, the newest datasets as well as which
datasets are currently the most popular.

A Self-Study Program
So, how can you make the best use of the UCI machine learning repository?

I would advise you to think about the traits in problem datasets that you would like to learn about.
These may be traits that you would like to model (like regression), or algorithms that model these traits that you
would like to get more skillful at using (like random forest for multi-class classification).

An example program might look like the following:

Binary Classification: Pima Indians Diabetes Data Set (available here)

Multi-Class Classification: Iris Data Set
Regression: Wine Quality Data Set
Categorical Attributes: Breast Cancer Data Set
Integer Attributes: Computer Hardware Data Set
Classification Cost Function: German Credit Data
Missing Data: Horse Colic Data Set

This is just a list of traits, can pick and choose your own traits to investigate.

I have listed one dataset for each trait, but you could pick 2-3 different datasets and complete a few small projects
to improve your understanding and put in more practice.

For each problem, I would advise that you work it systematically from end-to-end, for example, go through the
following steps in the applied machine learning process:

1. Define the problem

2. Prepare data
3. Evaluate algorithms
4. Improve results
5. Write-up results

Select a systematic and repeatable process that you can

use to deliver results consistently.

For more on the process of working through a machine learning problem systematically, see my post titled “Process
for working through Machine Learning Problems“.

The write-up is a key part.

It allows you to build up a portfolio of projects that you refer back to as a reference on future projects and get a
jump-start, as well as use as a public resume or your growing skills and capabilities in applied machine learning.
For more on building a portfolio of projects, see my post “Build a Machine Learning Portfolio: Complete Small
Focused Projects and Demonstrate Your Skills“.

But, What If…

I don’t know a machine learning tool.
Pick a tool or platform (like Weka, R or scikit-learn) and use this process to learn a tool. Cover off both practicing
machine learning and getting good at your tool at the same time.

I don’t know how to program (or code very well).

Use Weka. It has a graphical user interface and no programming is required. I would recommend this to beginners
regardless of whether they can program or not because the process of working machine learning problems maps
so well onto the platform.

I don’t have the time.

With a strong systematic process and a good tool that covers the whole process, I think that you could work
through a problem in one-or-two hours. This means you could complete one project in an evening or over two
evenings.

You choose the level of detail to investigate and it is a good idea to keep it light and simple when just starting out.

I don’t have a background in the domain I’m modeling.

The dataset pages provide some background on the dataset. Often you can dive deeper by looking at publications
or the information files accompanying the main dataset.

I have little to no experience working through machine learning problems.

Now is your time to start. Pick a systematic process, pick a simple dataset and a tool like Weka and work through
your first problem. Place that first stone in your machine learning foundation.

I have no experience at data analysis.

No experience in data analysis is required. The datasets are simple, easy to understand and well explained. You
simply need to read up on them using the data sets home page and by looking at the data files themselves.

Action Step
Select a dataset and get started.

If you are serious about your self-study, consider designing a modest list of traits and corresponding datasets to
investigate.

You will learn a lot and build a valuable foundation for diving into more complex and interesting problems.

Did you find this post useful? Leave a comment and let me know.

Share Tweet Share

About Jason Brownlee

Jason Brownlee, PhD is a machine learning specialist who teaches developers how to get results with modern
machine learning methods via hands-on tutorials.
View all posts by Jason Brownlee →

 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset

5 Ways To Understand Machine Learning Algorithms (without math) 

119 Responses to Practice Machine Learning with Datasets from the UCI Machine
Learning Repository

REPLY 
hossein September 11, 2015 at 3:22 pm #

dear Jason,
You are the best teacher.because you make simple things.

REPLY 
Jason Brownlee September 11, 2015 at 5:22 pm #

Thanks hossein!
REPLY 
Satyam January 9, 2020 at 4:00 am #

Awesome post for any newbies in Data Science, really appreciate the work.

REPLY 
Jason Brownlee January 9, 2020 at 7:32 am #

Thanks!

REPLY 
Khin myo myat April 10, 2020 at 9:23 pm #

Thank you so much Sir Jason.I am surely looking forward to pracitsing like you suggest

REPLY 
Jason Brownlee April 11, 2020 at 6:18 am #

You’re welcome.

REPLY 
Rash December 2, 2015 at 8:12 pm #

Hi Jason,
Thank you for this great post. I just began my study of data analysis and was totally confused when to began doing
projects. This post is truly enlightening.

REPLY 
Justin December 25, 2015 at 8:05 pm #

You mention something that is confusing… “For example, here is the webpage for the Abalone Data Set that
requires the prediction of the age of abalone from their physical measurements.”

Why do you use the word “requires”? The webpage requires… Or the dataset requires? No. Confuses.

Wouldn’t this make more sense…”The dataset provides content to the learning machine to predict the age of an
Abalone from physical measurements.”

REPLY 
abhijit kamune April 5, 2016 at 1:59 am #

I can say it is a one stop solution for Machine Learning Problem.

Thank you very much Jason ,You make my life easy….. 🙂

REPLY 
Jason Brownlee April 8, 2016 at 1:37 pm #

You’re very welcome.

REPLY 
Wlodzimierz April 14, 2016 at 10:26 pm #

Hi, could you recommend me one or a few data sets on computer system resources usage just for the
purpose of machine learning ?
Thanks in advance.

REPLY 
Adam June 25, 2016 at 6:55 am #

Hi Jason,

I have a question for example dataset wine quality:

how should I look at data?
I’ve opened the data and I can see that density and resuidal sugar are higly corelated. But what now? I should try to
draw a plot for each feature? (e.g plot(x1,quality) plot(x2,quality) and so on?
Could you give some advice what steps should be taken?

Regards,
Adam

REPLY 
Jason Brownlee June 26, 2016 at 5:59 am #

Hi Adam, take a look at this process for working through an applied machine learning problem:
https://machinelearningmastery.com/process-for-working-through-machine-learning-problems/

REPLY 
Elton June 30, 2016 at 10:00 am #

Exactly what I was searching for, thank you so much!

REPLY 
Jason Brownlee June 30, 2016 at 10:31 am #

You’re welcome Elton.

REPLY 
Liem Le July 12, 2016 at 10:41 pm #

As a naive programmer, recently graduate from Clg, your posts is what I looking for.
Thank.

REPLY 
Jason Brownlee July 13, 2016 at 5:00 am #

Glad to hear it!

REPLY 
Sonal Somani August 2, 2016 at 1:31 pm #

Hey Jason, this is really nicely broken down into steps. I always felt that I get too involved into the problems
that I miss the big picture but I think keeping a process and working through it is a good way to approach learning.
And I am definitely looking forward to practising like you suggest. Thanks!

REPLY 
Jason Brownlee August 3, 2016 at 5:53 am #

I’m glad it was useful to you Sonal.

REPLY 
kay August 14, 2016 at 6:49 am #

What a find! This is awesome beyond words, Jason; thank you!!!

REPLY 
Jason Brownlee August 15, 2016 at 12:35 pm #

You’re welcome kay.

REPLY 
Kartik August 18, 2016 at 4:08 pm #

Hey Jason,

Got a nice link flow is nice in simple words and detailed explanation.

Thanks

REPLY 
Jason Brownlee August 19, 2016 at 5:23 am #

I’m glad it’s useful Kartik.

REPLY 
Lumilog September 9, 2016 at 9:08 am #

Hi Jason!

How do you handle the datasets not seeming to have any benchmarks for what a poor, fair, or good accuracy is for
prediction? If I pick some binary classification dataset to practice on and get say, an ROC = 0.6, how am I to know if
that’s a fantastic result or there’s still a lot of improving I could do with respect to how others have done?

Thanks

REPLY 
Jason Brownlee September 10, 2016 at 7:10 am #
Great question!

The answer is to use ZeroR or similar to baseline the problem and determine the point from which all other
results can be compared. After you run through a suite of good standard algorithms you will get a feel for what
result is “easy” to achieve, providing a new baseline from which to improve.

From there, interpretation of results is problem specific.

REPLY 
almas November 16, 2016 at 9:25 pm #

This post is really good for beginners sir,thank you

REPLY 
Jason Brownlee November 17, 2016 at 9:53 am #

I’m glad to hear it almas.

REPLY 
LinboLee November 21, 2016 at 12:42 am #

Hello Jason,
Thanks for your post, it is very helpful. But I have one question, which is how to validate your results or your
implemented algorithms? How to compare our results with a better one?

Thanks.

REPLY 
Jason Brownlee November 22, 2016 at 6:50 am #

Hi LinboLee, good questions.

You can evaluate the performance of your models by estimating their performance on unseen data. You can do
this with resampling methods like k-fold cross validation.

You can then compare the skill of multiple algorithms on the problem.

You can compare to previously published results by re-creating their test setup.

REPLY 
yexiuqiang May 29, 2017 at 6:52 pm #

I think I get the point for how to learn machine learning. Thank you.

REPLY 
Jason Brownlee June 2, 2017 at 12:23 pm #

I’m glad to hear that.

REPLY 
Priyank June 4, 2017 at 4:52 am #
Thanks for such a freat article, You are working great,
Sir!

REPLY 
Jason Brownlee June 4, 2017 at 7:55 am #

Thanks Priyank.

REPLY 
Kotrappa Sirbi June 19, 2017 at 9:03 pm #

Thank you so much Dr Jason Brownlee

REPLY 
Jason Brownlee June 20, 2017 at 6:36 am #

I’m glad it helped.

REPLY 
Lautaro June 29, 2017 at 3:55 am #

Very good article, as always you can articulate the theoretical and practical issues in predictive modeling.

I also recommend kaggle data sets. They are also free, have big and small data sets.

REPLY 
Jason Brownlee June 29, 2017 at 6:39 am #

Great suggestion.

REPLY 
Francisco Cabrera July 21, 2017 at 5:13 am #

Thank you for this refreshing article, Jason! Although your explanations are simple, they are deep and very
well thought at the same time. I am learning a lot from your writings.

Once again, thank you for sharing your wisdom and knowledge with us.

REPLY 
Jason Brownlee July 21, 2017 at 9:36 am #

Thanks Francisco, kind of you to say. Hang in there!

REPLY 
lab August 10, 2017 at 8:42 pm #

How are you you can add this mine of good and open data sets http://www.andbrain.com/
REPLY 
Jason Brownlee August 11, 2017 at 6:41 am #

Thanks for sharing.

REPLY 
Nada September 12, 2017 at 2:40 am #

You are awesome Jason. I have been looking for such a map for a long time! Thank you so much.

REPLY 
Jason Brownlee September 13, 2017 at 12:26 pm #

I’m glad it helped.

REPLY 
huda September 27, 2017 at 9:20 pm #

that is a valuable word you are really motivating me to work hard to know everything about Machine
Learning.

REPLY 
Jason Brownlee September 28, 2017 at 5:25 am #

I’m really glad to hear that!

REPLY 
Ali October 17, 2017 at 10:34 pm #

Thanks a lot Jason for providing invaluable information about Machine Learning.

REPLY 
Jason Brownlee October 18, 2017 at 5:36 am #

I’m glad it helped.

REPLY 
shivaprasad October 24, 2017 at 6:58 pm #

awesome sir ,really a good one.Thank you

REPLY 
Jason Brownlee October 25, 2017 at 6:44 am #

I’m glad it helped.

REPLY 
Vova November 6, 2017 at 11:02 pm #

Just want to say many thanks to you, Jason

Your articles really very helpful!

REPLY 
Jason Brownlee November 7, 2017 at 9:49 am #

Thanks Vova, I really appreciate your support.

REPLY 
Shane November 28, 2017 at 8:39 am #

I love how you break down the types of machine learning problems. This is a great resource!

REPLY 
Jason Brownlee November 28, 2017 at 8:46 am #

Thanks Shane.

REPLY 
Parvez Khan January 8, 2018 at 2:56 pm #

Hi Jason

Thanks for excellent stuff on ML.

I really get my ideas clear just by yoir posts.
I got my current assignmen to compair at least four pricelists and to suggest the final prices list for our
company.please suggest the suitable algorithm for the same.

REPLY 
Jason Brownlee January 8, 2018 at 3:55 pm #

See this post:

https://machinelearningmastery.com/a-data-driven-approach-to-machine-learning/

REPLY 
S B Iqbal January 9, 2018 at 5:14 pm #

Hi Jason,

Thanks for a great post.

Could you also advice on how to scrap data from UC Irvine database using R. It would be great to see a tutorial on
that.

REPLY 
Jason Brownlee January 10, 2018 at 5:21 am #
No need to scrape the dataset, you can download them directly as CSV files.

REPLY 
praveen kumar March 5, 2018 at 5:48 pm #

How can i prepare my own dataset? Can you suggest me the path?
Here Raw data may be either images or integer array or character array or strings.

REPLY 
Jason Brownlee March 6, 2018 at 6:09 am #

I recommend this process:

https://machinelearningmastery.com/start-here/#process

REPLY 
Mustafa March 13, 2018 at 5:49 pm #

Thank you for your support.

REPLY 
Jason Brownlee March 14, 2018 at 6:17 am #

You’re welcome.

REPLY 
Asma May 11, 2018 at 8:47 pm #

Hi Jason,

I want to prepare a white paper submission on Responsible AI or Ethical AI.Can you suggest any usecase or problem
statement for it

REPLY 
Jason Brownlee May 12, 2018 at 6:31 am #

No, sorry it is not my area of expertise.

REPLY 
Arjun June 29, 2018 at 11:25 pm #

I wish i could be in regular touch with you bacause i want to be a REAL good Data Scientist and you
REALLY know the path which can lead one there.
This website is the best source for learning machine learning.

REPLY 
Jason Brownlee June 30, 2018 at 6:08 am #

You are in touch with me, ask questions any time via comments or via the contact form.
My best advice is here:
https://machinelearningmastery.com/start-here/

REPLY 
Aditya July 3, 2018 at 4:50 am #

The prima dataset is not avalilable now.

REPLY 
Jason Brownlee July 3, 2018 at 6:29 am #

You can get it here:

https://github.com/jbrownlee/Datasets

REPLY 
Amrit July 14, 2018 at 8:47 pm #

hello sir
can you please guide me the data set for urban water supply

REPLY 
Jason Brownlee July 15, 2018 at 6:12 am #

It is the default value. You can learn more about how to configure the model here:
https://radimrehurek.com/gensim/models/keyedvectors.html

REPLY 
Paul A. Gureghian July 21, 2018 at 5:13 am #

how to download a dataset from UCI? just the usual way in Python and R ? is there a download link on the
site ?

REPLY 
Jason Brownlee July 21, 2018 at 6:40 am #

They have a download link and you can use a web browser.

REPLY 
Winnie November 13, 2018 at 12:40 am #

Thank you so much for the article

REPLY 
Jason Brownlee November 13, 2018 at 5:46 am #

I’m happy it helped.

REPLY 
Liston November 19, 2018 at 5:08 pm #

Good post. I am in applied machine learning application. There is need to evaluate algorithms on good
datasets.

REPLY 
Jason Brownlee November 20, 2018 at 6:32 am #

Thanks.

REPLY 
Theresa November 21, 2018 at 10:39 pm #

thank you Jason. you have no idea of how helpful this is to me now. God bless

REPLY 
Jason Brownlee November 22, 2018 at 6:24 am #

I’m glad it helps!

REPLY 
Pooja December 21, 2018 at 6:01 pm #

Hello sir,
Thank you for such a nice information, it is very simple to understand. As a student of M Sc (Statistics), i m looking
for project in data mining, can you suggest something?

REPLY 
Jason Brownlee December 22, 2018 at 6:03 am #

Thanks, perhaps experiment with some of these dataset.

REPLY 
Viva March 9, 2019 at 7:20 pm #

Hi Jason,

Thanks for your articles. I have recently started reading your page and articles. I am a practicing analyst who enjoys
to play around data, what I lack is systematic approach to implementation of algorithms, I know them theoretically but
don’t have the confidence on implementing them. I have also joined mailing subscription from your website and also
reading your number of articles to start working with a plan. I generally get lost and overwhelmed in my learning
process and hence leave it between. Would request you to help me on how can I keep my learning process
productive.

REPLY 
Jason Brownlee March 10, 2019 at 8:16 am #
Try working through this tutorial:
https://machinelearningmastery.com/machine-learning-in-python-step-by-step/

REPLY 
Shaima March 19, 2019 at 1:22 am #

You are the best as usual professor jason

REPLY 
Jason Brownlee March 19, 2019 at 8:58 am #

Thanks. I’m glad it helped.

REPLY 
Priya August 17, 2019 at 2:27 am #

where i can get plant disease dataset for machine learning, can anyone please suggest me..

REPLY 
Jason Brownlee August 17, 2019 at 5:56 am #

This might help:

https://machinelearningmastery.com/faq/single-faq/where-can-i-get-a-dataset-on-___

REPLY 
Manjunath September 11, 2019 at 4:20 pm #

how to read the uci data sets in excel?could anyone help!

REPLY 
Jason Brownlee September 12, 2019 at 5:12 am #

See this tutorial:

https://machinelearningmastery.com/load-machine-learning-data-python/

REPLY 
shadrack kodondi October 16, 2019 at 9:05 pm #

after hovering around so many sites,i came here,the best i have ever visted for ML introductions…thanks so
much Jason

REPLY 
Jason Brownlee October 17, 2019 at 6:31 am #

Thanks!
REPLY 
Adarsh January 30, 2020 at 7:08 pm #

Hi Jason Sir,
How do I get the csv file from the UCI repository…………i am getting a txt file that is getting opened by Notepad
PLz help fast

REPLY 
Jason Brownlee January 31, 2020 at 7:43 am #

A CSV file is a text file.

Also, you can get the files here:

https://github.com/jbrownlee/Datasets

REPLY 
Chao March 17, 2020 at 2:17 am #

Hi, Jason.

Thank you for your posts which are so helpful to me.

Regarding the datsets from UCI repository, I’m wondering how I get csv format. Because I found that the files there
are with extension .data, not .csv.

REPLY 
Jason Brownlee March 17, 2020 at 8:18 am #

You might need to convert some to CSV format. Some might have .data extension and already have a
CSV format.

REPLY 
Peggy March 17, 2020 at 3:01 pm #

Thanks Jason, it is a wonderful tutorial for me to start learning machine learning. It gives me confidence to
continue the study.

REPLY 
Jason Brownlee March 18, 2020 at 6:07 am #

Thanks, I’m happy to hear that.

REPLY 
Fashi Uddin April 15, 2020 at 4:58 pm #

Wish I have this in my early time when I was starting with Data Science. Practice is the key for sure reading
soo many books will give you knowledge about the process but in one or two directions. Its practice which gives you
the exposure for real life scenarios.
REPLY 
Jason Brownlee April 16, 2020 at 5:58 am #

Thanks!

I agree.

REPLY 
Sachin Koti May 1, 2020 at 11:00 pm #

Wonderfully explained…
Thanks Jason!!
Knowledge grows by sharing and you are already great in doing that.

REPLY 
Jason Brownlee May 2, 2020 at 5:47 am #

Thanks!

REPLY 
wysohn May 6, 2020 at 3:44 pm #

This is the only site I often come back, and I think it simply shows how valuable the information you share is!

Thank you so much for spending time and putting lots of effort in doing this.

REPLY 
Jason Brownlee May 7, 2020 at 6:37 am #

Thanks, you are very kind!

REPLY 
Simon J Samaha June 10, 2020 at 2:33 pm #

Awesome insights.
Now i have experiment with weka 😉

REPLY 
Jason Brownlee June 11, 2020 at 5:49 am #

You’re welcome.

REPLY 
salwa rizk August 4, 2020 at 5:55 pm #

Thank you for your help,

as it may be a reason to give hope to non-specialists like me to start again after many failed attempts. i am grateful
for all helpful like you
REPLY 
Jason Brownlee August 5, 2020 at 6:09 am #

You’re welcome!

REPLY 
robin August 19, 2020 at 4:36 am #

@Jason ,

Thanks for your great work. It has improved my ML knowledge and increased my interest. I was wondering if there
are other ML repository you know of, specially, the ones that have raw datasets- just for the sake of working on my
data cleaning/pre-processing skills?

Thanks!

REPLY 
Jason Brownlee August 19, 2020 at 6:06 am #

Yes, see this:

https://machinelearningmastery.com/faq/single-faq/where-can-i-get-a-dataset-on-___

And this:
https://github.com/jbrownlee/Datasets

REPLY 
sai sowmya grandhi August 25, 2020 at 3:21 pm #

You made me feel that coding is not big deal as everybody exaggerates it. I have started using R
programming only because of you. Thanks for the confidence.

REPLY 
Jason Brownlee August 26, 2020 at 6:44 am #

Well done on your progress!

Yes, it’s not a big deal – just another tool for us to use to get a job done, like writing.

REPLY 
java37 November 1, 2020 at 8:08 am #

Much obliged to you for your posts which are so useful to me.

Concerning datsets from UCI vault, I’m considering how I get csv design. Since I found that the records there are
with expansion .data, not .csv.

REPLY 
Jason Brownlee November 1, 2020 at 8:26 am #

You can change the .data to .csv.

Also, Python does not care about the extension, only the content.

REPLY 
Jean-Christophe Chouinard February 26, 2021 at 2:38 pm #

Thanks for the list of models with their classifications, makes it easier to start.

REPLY 
Jason Brownlee February 27, 2021 at 6:00 am #

You’re welcome.

REPLY 
Nhung July 5, 2021 at 7:09 pm #

Thank you so much for the great sharing!

REPLY 
Jason Brownlee July 6, 2021 at 5:47 am #

You’re welcome.

REPLY 
Niaz Ahmad February 22, 2022 at 2:31 pm #

Sir, your work is greatly appreciated, kindly clear me at a point i want to detect a plant/fruit diseases system,
will it be better for me to use the existing datasets or to prepare my own. thanks and regards

REPLY 
James Carmichael February 23, 2022 at 12:30 pm #

Hi Niaz…Existing datasets may be tremendous help along with transfer learning.

https://machinelearningmastery.com/how-to-improve-performance-with-transfer-learning-for-deep-learning-
neural-networks/

REPLY 
K.Soumaia October 1, 2022 at 6:14 am #

Thanks and regards

REPLY 
James Carmichael October 1, 2022 at 6:52 am #

You are very welcome K.Soumaia!

Email (will not be published) (required)

SUBMIT COMMENT

Welcome!
I'm Jason Brownlee PhD
and I help developers get results with machine learning.
Read more

Never miss a tutorial:

Picked for you:

Find Your Machine Learning Tribe

What Is Holding You Back From Your Machine Learning Goals?

Difference Between Classification and Regression in Machine Learning

Machine Learning for Developers

Why Machine Learning Does Not Have to Be So Hard

Loving the Tutorials?

The EBook Catalog is where

you'll find the Really Good stuff.

>> SEE WHAT'S INSIDE

LinkedIn | Twitter | Facebook | Newsletter | RSS

Privacy | Disclaimer | Terms | Contact | Sitemap | Search

Image Processing - Notes
No ratings yet
Image Processing - Notes
239 pages
Evaluating LLM Models For Production Systems - Methods and Practices - Data Phoenix
No ratings yet
Evaluating LLM Models For Production Systems - Methods and Practices - Data Phoenix
61 pages
OptimisationII Notes
100% (1)
OptimisationII Notes
94 pages
Industrial Automation Technologies (Chanchal Dey (Editor) Sunit Kumar Sen (Editor) )
100% (2)
Industrial Automation Technologies (Chanchal Dey (Editor) Sunit Kumar Sen (Editor) )
376 pages
Edwin L. Woollett - Maxima by Example
100% (1)
Edwin L. Woollett - Maxima by Example
514 pages
Algorithms and Data Structures
0% (1)
Algorithms and Data Structures
161 pages
Mathematical and Numerical Analysis of Nonlinear Evolution Equations
No ratings yet
Mathematical and Numerical Analysis of Nonlinear Evolution Equations
210 pages
(Springer Series in Statistics) Michael L. Stein (Auth.) - Interpolation of Spatial Data - Some Theory For Kriging-Springer-Verlag New York (1999)
No ratings yet
(Springer Series in Statistics) Michael L. Stein (Auth.) - Interpolation of Spatial Data - Some Theory For Kriging-Springer-Verlag New York (1999)
262 pages
ML L8 Decision Tree
No ratings yet
ML L8 Decision Tree
109 pages
Ph.D. Thesis - OHAM
No ratings yet
Ph.D. Thesis - OHAM
99 pages
1 - Course Slides - Data Science and ML Fundamentals
No ratings yet
1 - Course Slides - Data Science and ML Fundamentals
92 pages
Leonard Parker, Steven M. Christensen - MathTensor - A System For Doing Tensor Analysis by computer-AW (1994)
No ratings yet
Leonard Parker, Steven M. Christensen - MathTensor - A System For Doing Tensor Analysis by computer-AW (1994)
396 pages
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Rajarama M. Jena, Subrat K. Jena, Snehashish Chakraverty - Computational Fractional Dynamical Systems - Fractional Differential Equations and Applications-Wiley (2022)
No ratings yet
Rajarama M. Jena, Subrat K. Jena, Snehashish Chakraverty - Computational Fractional Dynamical Systems - Fractional Differential Equations and Applications-Wiley (2022)
268 pages
Numerical Methods for Two-Point Boundary-Value Problems
From Everand
Numerical Methods for Two-Point Boundary-Value Problems
Herbert B. Keller
No ratings yet
Fuzzy Soft Set Theory and Its Applications
No ratings yet
Fuzzy Soft Set Theory and Its Applications
19 pages
Introduction To Python and Computer Programming 1704298503
No ratings yet
Introduction To Python and Computer Programming 1704298503
44 pages
Full Syllabus of Calicut University (2004) Information Technology (IT)
No ratings yet
Full Syllabus of Calicut University (2004) Information Technology (IT)
191 pages
Stock Price Prediction Using Machine Learning With Python
No ratings yet
Stock Price Prediction Using Machine Learning With Python
10 pages
Slicing. Both, Numpy Array Indexing and Slicing Will Be Discussed in The Remainder
No ratings yet
Slicing. Both, Numpy Array Indexing and Slicing Will Be Discussed in The Remainder
50 pages
ENG 202: Computers and Engineering Object Oriented Programming in PYTHON
No ratings yet
ENG 202: Computers and Engineering Object Oriented Programming in PYTHON
56 pages
Mathematical Foundations For AI Basic
No ratings yet
Mathematical Foundations For AI Basic
3 pages
Mathematical Modeling of Engineering Problems
No ratings yet
Mathematical Modeling of Engineering Problems
69 pages
Computational Optimal Transport
No ratings yet
Computational Optimal Transport
56 pages
Lecture 4.a - Greedy Algorithms
No ratings yet
Lecture 4.a - Greedy Algorithms
45 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
A Tacholess Order Tracking Methodology Based On A Probabilistic
No ratings yet
A Tacholess Order Tracking Methodology Based On A Probabilistic
17 pages
PETSc Tutorial
No ratings yet
PETSc Tutorial
132 pages
Lecture 3 EdgeDetection
No ratings yet
Lecture 3 EdgeDetection
52 pages
Levenberg Examples
100% (1)
Levenberg Examples
2 pages
C OMBINATORIAL M ODELS OF C OMPLEX S YSTEMSTesis Doctorado Eng
No ratings yet
C OMBINATORIAL M ODELS OF C OMPLEX S YSTEMSTesis Doctorado Eng
194 pages
Classifying The Supervised Machine Learning and Comparing The Performances of The Algorithms
No ratings yet
Classifying The Supervised Machine Learning and Comparing The Performances of The Algorithms
17 pages
Computational Tools and Software MATLAB Python
No ratings yet
Computational Tools and Software MATLAB Python
5 pages
Pilot
100% (1)
Pilot
78 pages
Least Square Vs Gradient Descent
100% (1)
Least Square Vs Gradient Descent
52 pages
SplineCNN-Fast Geometric Deep Learning With Continuous B-Spline Kernels
No ratings yet
SplineCNN-Fast Geometric Deep Learning With Continuous B-Spline Kernels
9 pages
Model Predictive Control Using YALMIP Getting Started
No ratings yet
Model Predictive Control Using YALMIP Getting Started
5 pages
2016-Scientific Computing With MATLAB-Paul Gribble-Math Eng Chap01
No ratings yet
2016-Scientific Computing With MATLAB-Paul Gribble-Math Eng Chap01
53 pages
Data Visualization With Ma Thematic A
No ratings yet
Data Visualization With Ma Thematic A
46 pages
Modeling Data and Curve Fitting - Non-Linear Least-Squares Minimization and Curve-Fitting For Python
No ratings yet
Modeling Data and Curve Fitting - Non-Linear Least-Squares Minimization and Curve-Fitting For Python
25 pages
Advanced Modeling in Biological Engineering Using Soft Computing Methods
No ratings yet
Advanced Modeling in Biological Engineering Using Soft Computing Methods
16 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Radial Basis Function
No ratings yet
Radial Basis Function
35 pages
MATLAB Project Subject: Linear Algebra: 3. Solutions
No ratings yet
MATLAB Project Subject: Linear Algebra: 3. Solutions
20 pages
Max and Min PDF
No ratings yet
Max and Min PDF
19 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
19 pages
Bayesian Model Updating
No ratings yet
Bayesian Model Updating
26 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Week 5 Programming Assignment: (Https://swayam - Gov.in)
No ratings yet
Week 5 Programming Assignment: (Https://swayam - Gov.in)
12 pages
Computer Education For Nepali School Students - QBASIC CLASS IX
No ratings yet
Computer Education For Nepali School Students - QBASIC CLASS IX
10 pages
Spring 2022 CS7643 Deep Learning Syllabus and Schedule - v5.1
No ratings yet
Spring 2022 CS7643 Deep Learning Syllabus and Schedule - v5.1
11 pages
NCP-IB Exam Questions
No ratings yet
NCP-IB Exam Questions
3 pages
23BCP119 Os Lab File
No ratings yet
23BCP119 Os Lab File
32 pages
R16 WT Manual
No ratings yet
R16 WT Manual
109 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
29 pages
8 Tactics To Combat Imbalanced Classes in Your Machine Learning Dataset
No ratings yet
8 Tactics To Combat Imbalanced Classes in Your Machine Learning Dataset
62 pages
WordPress For Library Site Design
No ratings yet
WordPress For Library Site Design
50 pages
Matlab Matlab Toolbox Deep Learning Toolbox Neural Network Toolbox Libraries Functions How To Use
No ratings yet
Matlab Matlab Toolbox Deep Learning Toolbox Neural Network Toolbox Libraries Functions How To Use
5 pages
An Introduction To Mathematics Behind Neural Networks
No ratings yet
An Introduction To Mathematics Behind Neural Networks
5 pages
Depth Prediction Single Image
No ratings yet
Depth Prediction Single Image
8 pages
1.2 Safety Precaution and Proper Care of Computer Equipment-1
No ratings yet
1.2 Safety Precaution and Proper Care of Computer Equipment-1
9 pages
Vim Method
No ratings yet
Vim Method
7 pages
2013 Ul Hasan Icdar Can We Build Lanugage Independent Ocr Using LSTM Networks
No ratings yet
2013 Ul Hasan Icdar Can We Build Lanugage Independent Ocr Using LSTM Networks
6 pages
AMT-C With SAP ADMS - 201603
No ratings yet
AMT-C With SAP ADMS - 201603
59 pages
Siemens Multiscale Modelling of Textile Composite Using WISETEX
No ratings yet
Siemens Multiscale Modelling of Textile Composite Using WISETEX
3 pages
How To Prepare Data For Machine Learning
No ratings yet
How To Prepare Data For Machine Learning
34 pages
Decision Trees & The Iterative Dichotomiser 3 (ID3) Algorithm
100% (1)
Decision Trees & The Iterative Dichotomiser 3 (ID3) Algorithm
8 pages
College Information System
68% (28)
College Information System
97 pages
Solution of Python
No ratings yet
Solution of Python
35 pages
Build A Machine Learning Portfolio
No ratings yet
Build A Machine Learning Portfolio
18 pages
Security For 5G Presentation
No ratings yet
Security For 5G Presentation
16 pages
Sequence, Sigma and Pi
No ratings yet
Sequence, Sigma and Pi
3 pages
Oodp Unit 2
No ratings yet
Oodp Unit 2
100 pages
Center Manifold Reduction
100% (2)
Center Manifold Reduction
8 pages
PDF Evolutionary Optimization Algorithms Full Online: Book Details
No ratings yet
PDF Evolutionary Optimization Algorithms Full Online: Book Details
1 page
How To Choose The Right Test Options When Evaluating Machine Learning Algorithms
No ratings yet
How To Choose The Right Test Options When Evaluating Machine Learning Algorithms
16 pages
Project Digital Clock
No ratings yet
Project Digital Clock
21 pages
Pmetrics Manual
No ratings yet
Pmetrics Manual
54 pages
2 - RCM User Registration
No ratings yet
2 - RCM User Registration
31 pages
Festival Cover Letter Example
100% (2)
Festival Cover Letter Example
8 pages
Haramaya University
No ratings yet
Haramaya University
22 pages
Inheritance Quiz (CS)
No ratings yet
Inheritance Quiz (CS)
23 pages
Computer Vision QB
No ratings yet
Computer Vision QB
3 pages
Parts of Microsoft Word and Shortcut Keys
No ratings yet
Parts of Microsoft Word and Shortcut Keys
32 pages
Dae31203 Lab3 Sem2sesi22-23
No ratings yet
Dae31203 Lab3 Sem2sesi22-23
14 pages
PHP Presentation
No ratings yet
PHP Presentation
16 pages
Strategy and Work Plan For Scanning and Digitization of Records For District Magistrate
No ratings yet
Strategy and Work Plan For Scanning and Digitization of Records For District Magistrate
4 pages
Touch Target Size - Android Accessibility Help
No ratings yet
Touch Target Size - Android Accessibility Help
1 page
Intership in Chennai For Mba
No ratings yet
Intership in Chennai For Mba
6 pages
Understanding The Information System Department
0% (4)
Understanding The Information System Department
9 pages
SKF TS 52labyrinth Seal Specification
No ratings yet
SKF TS 52labyrinth Seal Specification
3 pages
TOUR 121 - Paper 9 (Case Study Group)
No ratings yet
TOUR 121 - Paper 9 (Case Study Group)
4 pages
Resume Indhu
No ratings yet
Resume Indhu
1 page
Sekhar Testing Resume 3
80% (5)
Sekhar Testing Resume 3
4 pages

Practice Machine Learning With Datasets From The UCI Machine Learning Repository

Uploaded by

Practice Machine Learning With Datasets From The UCI Machine Learning Repository

Uploaded by

 Navigation

Click to Take the FREE Crash-Course

Practice Machine Learning with Datasets from the UCI

Share Tweet Share

Where can you get good datasets to practice machine learning?

Practice Practice Practice

This problem can stop you dead.

Which dataset should you use?

So How Do You Practice In A Targeted Way?

Different types of supervised learning such as classification and regression.

Such a program has a number of practical requirements, for example:

What is the UCI Machine Learning Repository?

UCI Machine Learning Repository

Benefits of the Repository

Criticisms of the Repository

An example program might look like the following:

Binary Classification: Pima Indians Diabetes Data Set (available here)

1. Define the problem

Select a systematic and repeatable process that you can

The write-up is a key part.

But, What If…

I don’t know how to program (or code very well).

I don’t have the time.

I don’t have a background in the domain I’m modeling.

I have little to no experience working through machine learning problems.

I have no experience at data analysis.

Share Tweet Share

More On This Topic

About Jason Brownlee

 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset

I can say it is a one stop solution for Machine Learning Problem.

You’re very welcome.

I have a question for example dataset wine quality:

Exactly what I was searching for, thank you so much!

You’re welcome Elton.

Glad to hear it!

I’m glad it was useful to you Sonal.

What a find! This is awesome beyond words, Jason; thank you!!!

You’re welcome kay.

I’m glad it’s useful Kartik.

From there, interpretation of results is problem specific.

This post is really good for beginners sir,thank you

I’m glad to hear it almas.

Hi LinboLee, good questions.

I’m glad to hear that.

Thank you so much Dr Jason Brownlee

I’m glad it helped.

Thanks Francisco, kind of you to say. Hang in there!

Thanks for sharing.

I’m glad it helped.

I’m really glad to hear that!

I’m glad it helped.

awesome sir ,really a good one.Thank you

I’m glad it helped.

Just want to say many thanks to you, Jason

Thanks Vova, I really appreciate your support.

Thanks for excellent stuff on ML.

See this post:

Thanks for a great post.

I recommend this process:

Thank you for your support.

No, sorry it is not my area of expertise.

The prima dataset is not avalilable now.

You can get it here:

Thank you so much for the article

I’m happy it helped.

I’m glad it helps!

Thanks, perhaps experiment with some of these dataset.

You are the best as usual professor jason

Thanks. I’m glad it helped.

This might help:

how to read the uci data sets in excel?could anyone help!

See this tutorial:

A CSV file is a text file.

Also, you can get the files here:

Thank you for your posts which are so helpful to me.