0% found this document useful (0 votes)
12 views18 pages

How To Create Datasets - Strategies and Examples

The document outlines six effective strategies for creating datasets for machine learning projects, including leveraging internal data, utilizing research dataset platforms, and scraping the web. It emphasizes the importance of compliance, security, and timeliness in data collection. Additionally, it provides examples of resources and tools to assist in dataset creation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views18 pages

How To Create Datasets - Strategies and Examples

The document outlines six effective strategies for creating datasets for machine learning projects, including leveraging internal data, utilizing research dataset platforms, and scraping the web. It emphasizes the importance of compliance, security, and timeliness in data collection. Additionally, it provides examples of resources and tools to assist in dataset creation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

12/19/24, 12:48 PM How to Create Datasets: strategies and examples

CHECK OUT OUR LATEST LLM RED TEAMING STUDY

Home / Data / Machine / How to Create Datasets: strategies


labeling learning and examples

How to
Create
Datasets:
strategies
and
examples
Solutions
Are you tired of scouring
the internet for the perfect
Company
dataset to train your ML
models? Worry no more!
This article will show you
Resources
six tried-and-true methods
for creating datasets that
Docs
will make your models
sing. These techniques
may not be magic, but

https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 1/18
12/19/24, 12:48 PM How to Create Datasets: strategies and examples

they'll
CHECK fitLATEST
OUT OUR the bill forTEAMING
LLM RED most STUDY
ML projects.

Other
Articles on
Table of Contents Topic
AI vs
1. Introduction
Machine
Learning:
2. Strategy #1 to Create An In-
your Dataset: ask your Depth
Cross
IT Analysis
Validation
User in the in
Machine
loop
Learning:
Human
Side business What Yo...
Pose
3. Strategy #2 to Create
Estimation:
Ultimate
your Dataset: Look for
Guide
Research Dataset Machine
[2023 e...
platforms
Learning
Defined
4. Strategy #3 to Create and
How To
your Dataset: Look for Explained
Monitor
GitHub Awesome
Machine
pages Learning
Solutions
5. Strategy #4 to Create Models
Machine
your Dataset: Crawl In Pro...
Learning for
Companyand Scrape the Web
Unstructured
Our
6. Strategy #5 to Create Document
Journey
Resourcesyour Dataset: Use An...
to
products API Cleaning
Docs the
How to
7. Strategy #6 to Create
Oceans
your Dataset: Look for manage
with
your
datasets used in Machi...
machine
research papers
https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 2/18
12/19/24, 12:48 PM How to Create Datasets: strategies and examples
research papers
CHECK OUT OUR LATEST LLM RED TEAMING STUDY learning
8. How to Create a
We are
pipeline
lifting
... AI to
Dataset of Amazon
the heights
Reviews with Python of
24 Best
and BeautifulSoup: A Kilimanja...
Machine
Step-by-Step Guide
Learning
Step 1: Install Datasets
Required for Kili
How
Libraries Chatbot...
Technology
Step 2: Import
and
AutoML
Libraries and
Helped Me
Set Up the How to Label
Scal...
Base URL with
Step 3: Define Interactive
Document
Segmentation
a Function to
Layout
Scrape
Analysis,
Reviews. a
Deep
Step 4: Scrape complete
Learning
Multiple Pages guide
vs.
and Save the Machine
Dataset Learning:
The Key
Kili
Conclusion
D...
Technology,
9. Key Takeaways industrialiser
l'annotation
Bias
...
Estimation:

Introduction
Solutions
a complete
guide for
Top
Machine...
Mistakes
Who loves datasets?! At Kili
Company
Technology, we do love datasets –it
to Avoid
when
won't be a shocker. But guess what
Fine-
Resources
none of us like it? Spending too Atuning
Brief
much time creating datasets (or Comput...
Introduction
searching for datasets). Although
Docs to
this step is essential to the machine Imbalanced
Mean
learning process, we must admit it: Datasets?
Average
this task gets daunting quickly. Do
Precision
https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 3/18
12/19/24, 12:48 PM How to Create Datasets: strategies and examples

not
CHECK worry,
OUT though:
OUR LATEST we've
LLM RED gotSTUDY
TEAMING you (mAP): a
covered! complete
Machine
guid...
Learning:
This article will go through the 6 Defined
common strategies to think of when and
building a dataset. Explained
How to
Perform
Although these strategies may not Distributed
be suitable for every use case, Using
Training?
they're common approaches to
ChatGPT
to pre-
consider when building a dataset
annotate
and should give you a hand in
Named
building your ML dataset. Without
EntitiesKili
Using
further due, let's create your ...
Technology
dataset! to work
with YOLO
Building
av7

Strategy #1 to Create your Training

Dataset: ask your IT


Dataset
in
How to
Machine
build a
When it comes to building and fine- Learni...
state of
tuning your machine-learning
the art
models, one strategy should be at Machine
the top of your list: using your data. Learn...
Not only is this data naturally
tailored to your specific needs, but
it's also the best way to ensure that
your model is optimized for the
types of data it will encounter in the Learn
Solutions
real world. So if you want to achieve More
maximum performance and
Company
accuracy, prioritize your internal Read
data first.

Resources Our
Here are additional techniques to
gather more data from your users:
Guides
Docs
User in the loop
Data
Labeling
https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 4/18
12/19/24, 12:48 PM How to Create Datasets: strategies and examples

CHECK OUT OUR LATEST LLM RED TEAMING STUDY Guide

Text
annotation
Guide

Natural
Language
Are you looking to get more data Processing
from your users? One effective way Guide
to do so is by designing your product
to make it easy for users to share
their data with you. Take inspiration
from companies like Meta (formerly
Facebook) and its fault-tolerant UX.
Users might not see it, but its UX
leads them to correct machine
errors or improve ML algorithms.

Side business
Let's focus on data gathered through
the "freemium" model –which is
particularly popular in the
Computer Vision field. By offering a
free-to-use app with valuable
features, you can attract a large
user base and gather valuable data
in the process. A great example of
Solutions
this technique can be seen in
popular photo-editing apps, which
Company
offer users powerful editing tools
while collecting data (such as face
Resources
images) for the company's core
business. It's a win-win for everyone
involved!"
Docs
Caveats

https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 5/18
12/19/24, 12:48 PM How to Create Datasets: strategies and examples

ToOUT
CHECK make the most
OUR LATEST of TEAMING
LLM RED your internal
STUDY
data, you should ensure it meets
these three crucial criteria:

1. Compliance: Ensure your data is


fully compliant with all relevant
legislation and regulations, such
as the GDPR and CCPA.
2. Security: Have the necessary
credentials and safeguards to
protect your data and ensure that
only authorized personnel can
access it.
3. Timeliness: Keep your data fresh
and up-to-date to ensure it's as
valuable and relevant as
possible.

Strategy #2 to Create your


Dataset: Look for Research
Dataset platforms

Solutions

Company

You can find several web pages or


Resources
websites that gather ready-to-use
datasets for machine learning.
Docs
Among the most famous:

Kaggle dataset:
https://www.kaggle.com/dataset
https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 6/18
12/19/24, 12:48 PM How to Create Datasets: strategies and examples

s OUR LATEST LLM RED TEAMING STUDY


CHECK OUT

Hugging Face datasets:


https://huggingface.co/docs/dat
asets/index
Amazon Datasets:
https://registry.opendata.aws/
UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/in
dex.php
Google's Datasets Search Engine:
https://datasetsearch.research.g
oogle.com/
Paper with code datasets:
https://paperswithcode.com/dat
asets
Subreddit datasets: r/datasets
US government's datasets:
Data.gov or Europe data
platform: data.europa.eu

Strategy #3 to Create your


Dataset: Look for GitHub
Awesome pages
Solutions

Company

Resources

Docs
GitHub Awesome pages are lists
that gather resources for a specific
domain –isn't it cool?! There are

https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 7/18
12/19/24, 12:48 PM How to Create Datasets: strategies and examples

fantastic
CHECK pages
OUT OUR LATEST LLMfor
REDmany things,
TEAMING STUDY
and lucky us: datasets as well.

Awesome pages can be on more or


less specific topics:
- You can find datasets on awesome
pages that gather resources with a
broad scope, ranging from
agriculture to economy and more:
https://github.com/awesomedata/awesome-
public-datasets or
https://github.com/NajiElKotob/Awesome-
Datasets
- But you can also find awesome
pages on more narrow and specific
topics. For example, datasets
focusing on tiny objects detection
https://github.com/kuanhungchen/awesome-
tiny-object-detection or few shot
learning
https://github.com/Bryce1010/Awesome-
Few-shot.

Strategy #4 to Create your


Dataset: Crawl and Scrape
the Web
Solutions

Company

Resources

Docs

Crawling is browsing a vast number


of web pages that might interest
https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 8/18
12/19/24, 12:48 PM How to Create Datasets: strategies and examples

you.
CHECK OUTScrapping is RED
OUR LATEST LLM about gathering
TEAMING STUDY
data from given web pages.

Both tasks can be more or less


complex. Crawling will be easier if
you narrow the pages to a specific
domain (for example, all Wikipedia
pages).

Both these techniques enable the


collection of different types of
datasets:

Available raw text, which can be


used to train large language
models.
A specific introductory text that
is used to train models
specialized in tasks: product
reviews and stars.
Text with metadata that enables
to train of classification models.
Multilingual text that instructs
translation models.
Images with legends that
enables training image
classification or image-to-text
models…

Solutions
Pro tip: you can build your crawler
and scrapper with the following
Company
python packages:

Resources
https://github.com/scrapy/scrap
y
Docs
https://pypi.org/project/beautifu
lsoup4/
https://selenium-
python.readthedocs.io/
https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 9/18
12/19/24, 12:48 PM How to Create Datasets: strategies and examples

You
CHECK OUTcan
OUR also
LATESTfind more
LLM RED specific
TEAMING STUDYbut
ready-to-use repositories on Github,
including:

Google Image scrapper:


https://github.com/jqueguiner/googleImagesWebScraping

News scrapper:
https://github.com/fhamborg/news-
please

Strategy #5 to Create your


Dataset: Use products API
Some big service providers or media
give an API in their product that you
can use to get data when it is open
source. You can, for example, think
of:

Twitter API to retrieve tweets:


https://developer.twitter.com/en/
docs/twitter-api and the lovely
python library:
https://github.com/tweepy/twee
py
Sentinelhub API to fetch satellite
data from sentinels or Landsat
Solutions
satellites https://www.sentinel-
hub.com/develop/api/
Company
Bloomberg API for business news
https://www.bloomberg.com/prof
Resources
essional/support/api-library/
Spotify API to get metadata
Docs
about songs:
https://developer.spotify.com/do
cumentation/web-api/

https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 10/18
Strategy #6 to Create your
12/19/24, 12:48 PM How to Create Datasets: strategies and examples

Dataset: Look for datasets


CHECK OUT OUR LATEST LLM RED TEAMING STUDY

used in research papers

You may be scratching your head


and wondering how on earth you'll
raise the suitable dataset to
visualize and solve your problem –
no need to pull your hair over it!

Odds are some chances that some


researchers were already interested
in your use case and faced the same
problem as you. If this is the case,
you can find the datasets they used
and sometimes built themselves. If
they publish this dataset on an
open-source platform, you can
retrieve it. If not, you can contact
Solutions
them to see if they accept sharing
their dataset – polite requests
Company
wouldn't hurt, wouldn't they?

Resources
How to Create a Dataset of
Docs
Amazon Reviews with
Python and BeautifulSoup: A
Step-by-Step Guide
https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 11/18
12/19/24, 12:48 PM How to Create Datasets: strategies and examples

CHECK OUT OUR LATEST LLM RED TEAMING STUDY


Now that we’ve shared all our
strategies to find or to build your
own datasets, let’s practice our
dataset-building skills with a real-
life example.

Here’s your step-by-step tutorial on


extracting valuable insights from
Amazon reviews using Python and
BeautifulSoup.

By the end of it, you'll have a fully


functional Python script that
effectively scrapes Amazon reviews
and compiles them into a clean,
structured dataset ready for
analysis.

Let's jump right in!

Step 1: Install Required Libraries


Make sure you have Python and the
required libraries installed on your
system, and let's examine the code.
Before diving into the code, make
sure you have the following libraries
installed:

Solutions
requests
BeautifulSoup4
Company
pandas

You can install them using pip:


Resources

Docs

https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 12/18
Step 2: Import Libraries and Set Up the
12/19/24, 12:48 PM How to Create Datasets: strategies and examples

Base URL
CHECK OUT OUR LATEST LLM RED TEAMING STUDY

Begin by importing the necessary


libraries and establishing the base
URL for Amazon's product page:

This sets the groundwork for our


script by importing the libraries and
defining the base URL to access the
Amazon product review pages.

Step 3: Define a Function to Scrape


Reviews.
Now, create a function to scrape
reviews from a single page:

Solutions

Company

Resources

Docs

https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 13/18
12/19/24, 12:48 PM How to Create Datasets: strategies and examples

CHECK OUT OUR LATEST LLM RED TEAMING STUDY

This function takes a product ID and


a page number as input, constructs
the URL, and sends an HTTP request
to fetch the review page. It then
parses the HTML content using
BeautifulSoup and extracts the
review title, content, and rating for
each review on the page.

Step 4: Scrape Multiple Pages and Save


the Dataset
Finally, create a function to scrape
reviews from multiple pages and
save them to a CSV file:
Solutions

Company

Resources

Docs

https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 14/18
12/19/24, 12:48 PM How to Create Datasets: strategies and examples

CHECK OUT OUR LATEST LLM RED TEAMING STUDY

This function, scrape_all_reviews,


takes the product ID and the number
of pages you want to scrape. It
calls the scrape_reviews function
for each page and collects the
reviews in a list. After all the pages
have been scraped, it converts the
list of reviews into a pandas
DataFrame and saves it as a CSV
file.

Conclusion
Congratulations! You've
successfully created a dataset of
Amazon reviews using Python and
BeautifulSoup. You can now utilize
this dataset for your machine
learning or data science projects.
This tutorial has provided you with a
foundation for web scraping
Solutions
techniques and the ability to collect
valuable data. Feel free to modify
Company
the script to suit your specific needs
or to target other websites. We hope

Resources
you found this tutorial beneficial in
your journey towards data science
mastery. Enjoy!
Docs

https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 15/18
12/19/24, 12:48 PM How to Create Datasets: strategies and examples

Key Takeaways
CHECK OUT OUR LATEST LLM RED TEAMING STUDY

So there you have it! With these six


strategies and this comprehensive
tutorial, you should be well on
building your dreams' dataset.

But wait a minute: since your


dataset is likely to be ready by now,
wouldn't it be time for you to
annotate it? To help you keep on this
dynamic, feel free to try the Kili
Technology platform by signing up
for a free trial.

Solutions

Company

Resources

Docs

https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 16/18
12/19/24, 12:48 PM How to Create Datasets: strategies and examples

CHECK OUT OUR LATEST LLM RED TEAMING STUDY

Get Started
Get started! Build better data, now.

Request a demo Get My Data Labeled

Products Tools Guides


LLM Alignment Image Annotation Tool Data Labeling Guide
LLM Evaluation Video Annotation Tool RAG Evaluation Guide
Data Labeling NLP Text Annotation LLM Evaluation Guide
Plans & Features Tool Text Annotation Guide
OCR Annotation Tool Natural Language
Solutions Geospatial Annotation Processing Guide
Tool Computer Vision Guide
Data Labeling Tool Image Annotation
Company
Guide
Video Annotation
Resources Guide

Docs
Company France United States
Press

https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 17/18
12/19/24, 12:48 PM How to Create Datasets: strategies and examples

47 boulevard
CHECK OUT OUR LATEST LLM RED TEAMING STUDY de 524 Broadway, New
Courcelles, 75008 Paris York, NY 10012

KILI TECHNOLOGY © 2023

PRIVACY POLICY

LEGAL NOTICE

SECURITY INFO

STATUS

https://kili-technology.com/data-labeling/machine-learning/create-dataset-for-machine-learning 18/18

You might also like