The Essential Guide To Training Data
The Essential Guide To Training Data
The Essential Guide To Training Data
to AI training data
Positive
Not Spam
In spam detection, the input is an
email or text message, while the label
would provide information about
Tokyo grandmas are making 2 million whether the message is spam or not.
yen a month from this CRAZY scheme!!!
Spam
Legal
In text categorization, sentences
provide the input while the target would
suggest the topic of the sentence, such
Despite an early red card, the champions
as finance or law. were two goals up at half-time.
Sports
World War II was a global war that involved most of the world's countries,
including the Allied and Axis powers. With 40 to 50 million deaths recorded,
it is still regarded as the largest and deadliest war in human history.
World War II
It’s easy to see from these examples that datasets are often highly specialized.
If two different AI programs use the exact same training data, it can result in at
least one suboptimal model. This is true even if both programs deal with the same
broad category of input information, such as sentences.
15%
Validation data is primarily used to 70% 20% 60%
determine whether your model can
correctly identify new data or if it’s
overfitting to your original dataset.
Think of data as your not-so-secret weapon. The better your data is, the better your
model will be.
The parameters that contribute to this network of meaning are different for every
dataset. Usually, they’re so numerous that it’s impossible to work them all out ahead
of time. However, there are some factors that often have a high degree of influence
on the size of your dataset. They are as follows:
For every parameter that your model Different levels of complexity also
needs to account for, it will need more require different training methods. Many
training data. For example, a model that traditional machine learning algorithms
identifies the make of a car has a fairly use structured learning, which has a
small, set number of parameters that fairly low ceiling for additional data. With
mostly relate to the vehicle’s shape. this method, you’ll quickly find that more
However, a model that has to determine data has very little impact. In contrast,
the cost of that car has to understand models that use unsupervised learning
a far bigger picture, including the car’s can improve without structure and
age, condition and any other economic figure out their own parameters. This
or social factors that might impact the method requires a lot more data and also
price. The higher number of parameters extends the learning curve where further
involved here mean that the second data can have a positive impact.
model requires a larger dataset.
Despite having the same raw data, one task yields five
times more labels than the other. If one single data point
can contain a large number of labels, then you might be OK
with a smaller overall dataset.
Of course, the task you want your model In the real world, a model can encounter a wide variety
to complete has a big impact on the of input data. A chatbot, for example, has to be able to
amount of data you need to collect. This understand a variety of languages written in formal, informal
is because some models need to have or even grammatically incorrect styles. It has to be able to
a higher level of performance on edge understand all of them in order to provide a high level of
cases and a lower rate of error. Think customer service. In cases where your model’s input won’t
of the difference between a model be highly controlled, you’ll need more data to help the model
that predicts the weather and one that function in that unpredictable environment.
identifies patients who are at imminent
risk of heart attacks. One of these has a The deciding factor for how much data you’ll need is your
much lower threshold for error than the project’s unique requirements and goals. Each project
other. The lower your acceptable level of requires a unique balance of all of these influencing factors,
risk, the more data you’ll need to ensure which you’ll have to figure out for yourself when coming up
that risk is mitigated. with that target dataset size. Keeping this in mind, let’s now
dive into some of the ways that you can begin to figure out
your data needs.
1. Rule of 10
If you want to make a more evidence-based decision and you already have access
to some data, you could create a study to evaluate your model’s ability based on the
size of your dataset. By plotting results on a graph after each iteration, you can start
to see a relationship between dataset size and your model’s ability. This will help
you to identify the point where more data starts to provide diminishing returns. This
approach is slightly more technical, involving the creation of a few logistic regression
problems, but it should give you a more reliable result.
In the end, an iterative approach to building a dataset is probably the best. Just
start working on your model with the data you have, and add more as it becomes
necessary. Once you start seeing results, your needs will clarify themselves.
Even if you manage to source the perfect number of data points for your training
program, it’s highly unlikely that your dataset will be ready to train your model. In
fact, data requires a lot of processing before it can have an impact. We’ll explore
exactly what that looks like in the next section.
Read any article about training precision against real-world data. What
data and you’ll quickly come constitutes high accuracy is determined
in the early stages of the project, along
across the well-worn phrase with any additional metrics that need to
“garbage in, garbage out (GIGO)”. be used to measure quality.
While it’s true that your model will
only be as good as the data you During this process, it’s important that
feed it, what exactly constitutes your dataset accurately reflects the real-
world environment in which the model
‘garbage’ is often glossed over.
will be used, from the characteristics of
As a result, many people who are the data to its distribution across classes.
working with machine learning
for the first time find it difficult to As a result, quality assessment for
approach quality assurance for training data focuses around processes
that develop clean, annotated and
their training data.
distributed datasets – a mirror image of
the data your model will encounter after
In short, a high-quality dataset is one
it’s implemented.
that has undergone a rigorous cleaning
process and that contains all of the
Cleaning and assessing your data in this
necessary parameters for your model
way isn’t always the most exciting part
to learn to do its task. These parameters
of machine learning, but it’s absolutely
will be consistent across the entire
crucial if you want to build a useful
dataset, with no overlap between them.
model. Let’s unpack quality for machine
The goal of the quality assessment
learning training data and explore what
process is to develop a model that
that looks like in your dataset.
performs at a high level of accuracy and
While these general themes are present in all great datasets, it’s important to remember
that every project is different – and that quality means something slightly different for
each project as a result. Always make sure that you have a thorough understanding of
your project specifications, as this will help you to build a customized QA process that will
improve your dataset.
Unstructured Structured
Mitigate bias
Bias occurs when the data you have doesn’t accurately reflect
the conditions your model will operate in. It comes in many
forms, from gender or racial bias to observer and selection
bias. It’s extremely important that you avoid all forms of bias for
your model to work correctly.
Since every model has a different task or specialism, the process of data
cleansing will look different from project to project. However, there are certain
data cleansing tasks which are required for every dataset. They target specific
structural problems that are present in many datasets and are extremely
important to ensure your project’s success. If you spot any of these problems in
your data, it’s crucial that you take action to solve them:
Duplication
Outliers
In some cases, you may have classes that have been mislabeled within your data,
leading to messy class distribution. For example: ‘USA’, ‘usa’, and ‘the US’ could all
be separate classes within a dataset. Inconsistent capitalization and similar errors
can lead to multiple classes within the dataset that denote the same thing. Fix
any typos, capitalization or mislabeled classes in your dataset for a cleaner, more
accurate distribution.
us USA
Missing values
Some of your data may have fields which are missing information. It’s important
not to ignore these, since most algorithms can’t accept missing values. How
you handle missing values will depend on the model you’re trying to build, the
variable which is missing data and how much data is missing. Based on these
considerations, there are several possible strategies, including deleting rows with
missing data, deleting the variable or replacing missing values. Make sure to do
some thorough analysis before making your choice.
Product Category
Harry Potter and the Chamber of Secrets Fiction - Young Adult
1. F1 score
Precision: Relevant:
2. Inter-annotator agreement
Also called inter-rater reliability, this measures the degree of agreement amongst
your team of annotators. This is particularly useful for subjective tasks that could
be labeled differently by a range of annotators, such as sentiment analysis.
There are several statistical methods for doing this, but one of the more popular
amongst data scientists is Krippendorff’s alpha.
A gold standard is a set of data that reflects the ideal labeled data for your task. It’s
usually put together by one of your data scientists, who understands exactly what
the dataset needs to achieve. Gold standards enable you to measure your team’s
annotations for accuracy, while also providing a useful reference point as you
continue to measure the quality of your output during the project. They can also be
used as test datasets to screen annotation candidates.
Statistical representations can enable you to easily pinpoint outliers in your newly-
annotated data for further review. While these often indicate human error from
annotation, you should confirm this before taking action. After all, genuine outliers
can be an important source of information for your model.
This involves asking multiple annotators to label the same data point. Multipass
allows you to directly measure consistency within your pool of annotators and is
particularly useful for improving quality for subjective tasks. If done to a large enough
extent, it can allow you to make assumptions about the overall level of agreement in
your dataset.
DATA DATA
Most data labeling projects spend a significant amount of time reviewing the output
of each annotator in the team. One way to do this is to implement self-agreement
checks, where you give the same annotator the same piece of data twice to see if
they label it the same way the second time around. By doing this multiple times with
different pieces of data, you can observe annotator consistency, begin to target
sections of the dataset which are likely to have errors and give your annotators
advice that improves the quality of their work.
DATA
Iterate continuously
Overlapping
annotations
Content Cultural
moderation differences
Background /
ambient noise
Maintaining quality
episode was shot in Paris
Free options
Pro Cons
Most of the time, open datasets consist of information that is publicly available
through government sites or social media. While there are an increasing number
of useful open datasets available online, there will be times where free options
can’t get you the training data you need. Luckily, there are other inexpensive ways
to create custom datasets for your specific use cases.
Before opting to outsource training data Before you look for datasets elsewhere,
services, you should first check to see you should try to repurpose the data you
what in-house options you have available already have to build a larger dataset.
and if they’ll help you to create the One common way to do this is through
datasets that you need. For example, if data augmentation. For image datasets
you’re building a chatbot to handle online especially, there are numerous simple
inquiries, you should get in touch with ways to increase your training data
your customer service department to through simple image rotations,
see if they have stored chat logs or email color contrasts and other image
threads you can use to train your model. manipulations.
Of course, data availability depends
highly on the problem you are trying to
solve with your machine learning project.
Paid options
Pro Cons
Sometimes free and internal options aren’t able to provide you with machine
learning datasets at the scale and quality you require. In these cases, it’s often more
efficient to simply outsource training data from a data annotation company rather
than building a data collection and annotation infrastructure on your own. Luckily,
there are a variety of training data outsourcing options available to you.