0% found this document useful (0 votes)

75 views31 pages

The Essential Guide To Training Data

The document discusses training data which is used to develop machine learning models. Training data usually consists of a large number of data points with labels and metadata that teach algorithms how to perform tasks. It is important for machine learning projects and comes in different forms depending on the task. The quality of training data directly impacts the resulting model, so data must be properly collected, cleaned, and labeled. There are also different types of data like training, validation, and testing data that serve different purposes in the model development process.

Uploaded by

Lavanya Easwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views31 pages

The Essential Guide To Training Data

Uploaded by

Lavanya Easwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

The essential guide

to AI training data

The essential guide to AI training data - TELUS International 1

Training data is a resource used by engineers to
develop machine learning models. It’s used to train
algorithms by providing them with comprehensive,
consistent information about a specific task.
Training data is usually composed of a large
number of data points, each formatted with labels
and other metadata. How you build, format and
annotate your training dataset has a direct impact
on the model you create. In fact, poorly processed
data is one of the most common reasons that
machine learning projects fail.

However, if you haven’t worked with training data

before, it can be difficult to know where to start.
After all, data can be surprisingly complex. It’s hard
to figure out what a dataset should look like and
how to improve it.

TELUS International provides professional data

services to some of the world’s largest companies,
and our collection capabilities at scale remain a
source of competitive advantage for disruptive
brands and leaders in artificial intelligence (AI).
We’ve put together this helpful guide to address
some of the fundamental considerations when
embarking on an AI data project.

The essential guide to AI training data - TELUS International 2

Contents

What is AI training data? 04

Training, testing and validation data 07
Why is training data important? 08
How much training data do I need? 09
Why is it difficult to estimate dataset size? 10
How can I calculate my data needs? 12
How can I improve the quality of my data? 14
Data collection 17
Data cleansing 19
Data labeling 22
Where can I get more training data? 28
About TELUS International AI Data Solutions 31

The essential guide to AI training data - TELUS International 3

What is AI training data?
Training data is a collection of strengthen the final product. The best
labeled information that’s used to data is usually extremely rich in detail
and capable of improving your model
build a machine learning model. It
over many training cycles.
usually consists of annotated text,
images, video or audio. Through Most training data contains pairs of
training data, an AI model learns input information and corresponding
to perform its tasks at a high level annotations, which are often called the
target. These annotations, also called
of accuracy.
tags or labels, contain relevant metadata
that helps your model to make more
In other words, training data is the
accurate predictions. Since these labels
textbook that will teach your model to do
are so important to the training process,
its assigned task. Think of the algorithm
the makeup of each individual dataset
as a student and the training dataset as
can vary drastically.
a textbook filled with example problems.
The more the algorithm “studies”, the
For example:
better it does in the final test, which is
real world application. In sentiment analysis, training data
is usually composed of sentences,
Training data plays several different reviews or tweets as the input, with the
roles in the development of your model. label indicating whether that piece of
In fact, most datasets are used multiple text is positive or negative.
times throughout the training process,
as this helps to refine the model’s
predictions and improve its success Not a fan of the cake, though
rate. This is possible because of the
variables contained in the data. Through
identifying and evaluating variables, it’s
possible to assess their impact on the
Negative
model and adjust them in ways that will

The coffee here is great!

Positive

The essential guide to AI training data - TELUS International 4

In image recognition, the input would be
the image, while the label suggests what
is contained within the image.

Hi all - just writing to conﬁrm that the

meeting will be held at 12.00.

Not Spam
In spam detection, the input is an
email or text message, while the label
would provide information about
Tokyo grandmas are making 2 million whether the message is spam or not.
yen a month from this CRAZY scheme!!!

Spam

If a new agreement is concluded between

lessor and lessee, the terms of this
contract shall be considered null and void.

Legal
In text categorization, sentences
provide the input while the target would
suggest the topic of the sentence, such
Despite an early red card, the champions
as finance or law. were two goals up at half-time.

Sports

The essential guide to AI training data - TELUS International 5

It’s also possible to have multiple labels for a single piece
of raw input data. For example, in bounding box annotation,
various types of vehicles can be ordered into different
classes:

It’s also common to have multiple labels for a single data

point in text-based tasks. For text classification, words can
be ordered into a variety of categories based on meaning:

World War II was a global war that involved most of the world's countries,
including the Allied and Axis powers. With 40 to 50 million deaths recorded,
it is still regarded as the largest and deadliest war in human history.

World War II

It’s easy to see from these examples that datasets are often highly specialized.
If two different AI programs use the exact same training data, it can result in at
least one suboptimal model. This is true even if both programs deal with the same
broad category of input information, such as sentences.

The essential guide to AI training data - TELUS International 6

Training, testing and validation data
Datasets can be used in multiple Validation enables your data scientists
ways to serve a range of different to adjust hyperparameters and improve
your model’s accuracy.
training processes. In fact, to
build a machine learning model Testing data is used after both training
you’ll need three types of training and validation. It aims to test the
data, each of which performs a accuracy of your final model against
different role. your targets. It also provides further
confirmation that your model isn’t
Before going any further, it’s worth overfitting to your training and validation
noting that the term ‘training data’ has data.
two separate meanings. Training data
is used both as an umbrella term for the These three types of data usually
total data needed for your project, and work best if they’re smaller parts of
to refer to one of the specific subsets one overarching dataset. This helps to
of data below. While this might be ensure consistency and keeps the data
confusing initially, there are important relevant to the goals of your project. To
differences between these three types avoid selection bias, it’s best to shuffle
of data: your data into these three categories
randomly. Common ratios for these splits
Training data is used to help your are as follows:
machine learning model make
predictions. It’s the largest part of your Training data Validation data Test data

dataset, forming at least 70 to 80% of the

total data you’ll use to build your model.
This data is used exhaustively across 10%
multiple training cycles to improve the 10%

accuracy of your algorithm. Training data

80%
is different from validation and testing
data in that its classes are often evenly
distributed. Depending on your task,
this might mean that the data doesn’t
accurately reflect its real-world use case. 15%
20%

15%
Validation data is primarily used to 70% 20% 60%
determine whether your model can
correctly identify new data or if it’s
overfitting to your original dataset.

The essential guide to AI training data - TELUS International 7

Why is training data important?
Without training data, machine learning would not exist. In fact, the success or failure
of your model will be defined by the cleanliness, relevance, and quality of your
data. Just as a student with an outdated textbook is unlikely to achieve top marks,
your model will only excel at its task if your data accurately reflects the real-world
scenarios your model is built for. As a result, the world’s best models are built on
comprehensive datasets, complete with a range of detailed labels.

Think of data as your not-so-secret weapon. The better your data is, the better your
model will be.

The essential guide to AI training data - TELUS International 8

How much training data do I need?
When starting out with a new another machine learning model just to
machine learning project, it’s figure them all out and calculate their
weights. Instead, it’s better to approach
common to ask how much data quantity as an iterative process,
training data you’ll need to ensure adding more data as and when your
high performance from your team decides it’s necessary.
algorithm. The question itself
is simple. The answer – not so So, the short answer to the question is
usually to start with as much data as you
much. The blunt truth is that
can reasonably get and let the model’s
there’s no magic number of data needs take it from there. However, if you
points that will turn your model really want a ballpark figure for those
from good to great. hundreds, thousands, or millions of data
points, you’ll need to do a little bit of
The reason for this is that the number of research. Below, we’ll discuss some of
data points you’ll need for your project the more common issues to watch out
is affected by a wide range of factors, for when it comes to dataset size. After
all of which can influence the eventual that, we’ll discuss how you can develop
size of your dataset to a greater or lesser a rough idea of how big your dataset
degree. Due to the nature of machine should be and provide some publicly
learning, it’s unlikely that you’ll ever know available figures from real machine
all of these factors. You’d probably need learning projects.

The essential guide to AI training data - TELUS International 9

Why is it difficult to estimate dataset size?
It’s difficult to figure out the exact size you’ll need for your dataset due
to the nature of the training process. Training aims to build a model
that understands the patterns and relationships within the dataset
as a whole, rather than just each single data point. This is why it’s
important to gather enough data to give your algorithm an accurate
understanding of the complex network of meaning behind and
between your data points.

The parameters that contribute to this network of meaning are different for every
dataset. Usually, they’re so numerous that it’s impossible to work them all out ahead
of time. However, there are some factors that often have a high degree of influence
on the size of your dataset. They are as follows:

Complexity of model Training method

For every parameter that your model Different levels of complexity also
needs to account for, it will need more require different training methods. Many
training data. For example, a model that traditional machine learning algorithms
identifies the make of a car has a fairly use structured learning, which has a
small, set number of parameters that fairly low ceiling for additional data. With
mostly relate to the vehicle’s shape. this method, you’ll quickly find that more
However, a model that has to determine data has very little impact. In contrast,
the cost of that car has to understand models that use unsupervised learning
a far bigger picture, including the car’s can improve without structure and
age, condition and any other economic figure out their own parameters. This
or social factors that might impact the method requires a lot more data and also
price. The higher number of parameters extends the learning curve where further
involved here mean that the second data can have a positive impact.
model requires a larger dataset.

The essential guide to AI training data - TELUS International 10

Labeling needs
My daughter Carla loved her time at
You can annotate a data point in a variety
Ridgeview Dance Center.
of ways. As a result, how you label a data Thanks Emily and team for a
point can create significant variation in wonderful three years!
the number of data points you need. Let’s
imagine that we have 1,000 sentences
of input data. For sentiment analysis, you
might only label each sentence once, as
positive, negative or neutral. However, Positive
for entity extraction, you might label five
words in each sentence.

Despite having the same raw data, one task yields five
times more labels than the other. If one single data point
can contain a large number of labels, then you might be OK
with a smaller overall dataset.

Tolerance for errors Diversity of input

Of course, the task you want your model In the real world, a model can encounter a wide variety
to complete has a big impact on the of input data. A chatbot, for example, has to be able to
amount of data you need to collect. This understand a variety of languages written in formal, informal
is because some models need to have or even grammatically incorrect styles. It has to be able to
a higher level of performance on edge understand all of them in order to provide a high level of
cases and a lower rate of error. Think customer service. In cases where your model’s input won’t
of the difference between a model be highly controlled, you’ll need more data to help the model
that predicts the weather and one that function in that unpredictable environment.
identifies patients who are at imminent
risk of heart attacks. One of these has a The deciding factor for how much data you’ll need is your
much lower threshold for error than the project’s unique requirements and goals. Each project
other. The lower your acceptable level of requires a unique balance of all of these influencing factors,
risk, the more data you’ll need to ensure which you’ll have to figure out for yourself when coming up
that risk is mitigated. with that target dataset size. Keeping this in mind, let’s now
dive into some of the ways that you can begin to figure out
your data needs.

The essential guide to AI training data - TELUS International 11

How can I calculate
my data needs?
A general estimate of the amount
of training data you’ll need should
be more than enough to get you
started. Here are a couple of
common methods for reaching a
ballpark figure:

1. Rule of 10

This is a common, if controversial, rule

of thumb which states that a model
requires ten times more data than it
has degrees of freedom. A degree of
freedom could be a parameter that
affects your model’s output, an attribute
of one of your data points, or even just a
column in your dataset. The aim of this
rule is to compensate for some of the
variability that all of your parameters
bring to the model’s input. Although
some people think this just reframes the
debate around another question that’s
impossible to answer, it does provide a
quick estimate that will help you to kick
off your project.

The essential guide to AI training data - TELUS International 12

2. Learning curves

If you want to make a more evidence-based decision and you already have access
to some data, you could create a study to evaluate your model’s ability based on the
size of your dataset. By plotting results on a graph after each iteration, you can start
to see a relationship between dataset size and your model’s ability. This will help
you to identify the point where more data starts to provide diminishing returns. This
approach is slightly more technical, involving the creation of a few logistic regression
problems, but it should give you a more reliable result.

Give you a more reliable result

After this point, more data

has diminishing returns.

In the end, an iterative approach to building a dataset is probably the best. Just
start working on your model with the data you have, and add more as it becomes
necessary. Once you start seeing results, your needs will clarify themselves.

Even if you manage to source the perfect number of data points for your training
program, it’s highly unlikely that your dataset will be ready to train your model. In
fact, data requires a lot of processing before it can have an impact. We’ll explore
exactly what that looks like in the next section.

The essential guide to AI training data - TELUS International 13

How can I improve the
quality of my data?

Read any article about training precision against real-world data. What
data and you’ll quickly come constitutes high accuracy is determined
in the early stages of the project, along
across the well-worn phrase with any additional metrics that need to
“garbage in, garbage out (GIGO)”. be used to measure quality.
While it’s true that your model will
only be as good as the data you During this process, it’s important that
feed it, what exactly constitutes your dataset accurately reflects the real-
world environment in which the model
‘garbage’ is often glossed over.
will be used, from the characteristics of
As a result, many people who are the data to its distribution across classes.
working with machine learning
for the first time find it difficult to As a result, quality assessment for
approach quality assurance for training data focuses around processes
that develop clean, annotated and
their training data.
distributed datasets – a mirror image of
the data your model will encounter after
In short, a high-quality dataset is one
it’s implemented.
that has undergone a rigorous cleaning
process and that contains all of the
Cleaning and assessing your data in this
necessary parameters for your model
way isn’t always the most exciting part
to learn to do its task. These parameters
of machine learning, but it’s absolutely
will be consistent across the entire
crucial if you want to build a useful
dataset, with no overlap between them.
model. Let’s unpack quality for machine
The goal of the quality assessment
learning training data and explore what
process is to develop a model that
that looks like in your dataset.
performs at a high level of accuracy and

The essential guide to AI training data - TELUS International 14

What is quality?
As we explained above, quality training data has a wide range of consistent, distinct
parameters that provide your model with all the information it needs to perform
its task. But let’s dive a little deeper. A high-quality dataset usually has all of the
following characteristics:

CHARACTERISTIC DEFINITION ACTION ITEMS

All data points attribute Check for irregularities

values equally and come when pulling data from
from comparable sources multiple internal or
Uniformity external sources

All data points are the Ensure that classes are

same distributed
Consistency

Dataset has enough Check that you have

parameters to cover all enough data; include
of the model’s use cases, examples of edge cases
including edge cases in an appropriate volume
Comprehensiveness

Dataset contains only Identify important

parameters which are parameters; consider
useful to your model asking a domain expert to
Relevancy perform analysis

Perform user analysis to uncover

Dataset accurately hidden biases; consider pulling
reflects the model’s user data from both internal and external
base sources; consider employing an
Diversity expert for a third-party perspective

While these general themes are present in all great datasets, it’s important to remember
that every project is different – and that quality means something slightly different for
each project as a result. Always make sure that you have a thorough understanding of
your project specifications, as this will help you to build a customized QA process that will
improve your dataset.

The essential guide to AI training data - TELUS International 15

Training data quality
best practices
There are a whole range of
processes that you could
undertake to improve the quality
of your model – so many, in fact,
that it can be difficult to know
where to begin. Before wading in,
it’s important to first clarify what
your model’s target accuracy is.
Knowing the acceptable rate of
error can help you to understand
how thorough your QA process
needs to be, which can change
how you approach certain tasks.

It’s also important to remember that

quality in machine learning is inherently
iterative. As you train your model,
you’ll probably find patterns in your
inaccuracies. For example, you might
find that your voice recognition system
struggles to recognize children’s voices.
When this happens, you’ll need to collect,
clean and annotate further data which
takes this into account.

Keeping that in mind, this section

outlines some of the biggest problems
that you’ll need to address at each stage
of the dataset creation process. It will
also give you some basic advice on
measuring quality and point you towards
some general processes that you can
implement as you finetune your dataset.

The essential guide to AI training data - TELUS International 16

Data collection
Ensuring high-quality data collection is no simple task. However,
there’s plenty you can do to make sure the data you gather is suitable
for the task at hand. In particular, there are a few best practices that
you’ll want to focus on. These are:

Coverage planning Check for structured and

unstructured data
Before you start collecting data, you
need to know what your target classes The data you’re about to collect will fall
are, how you plan to distribute your data into one of two categories. Structured
across those classes and your metrics data is machine-readable, annotated and
for measuring data quality. If you don’t has metadata that the model can learn
have an answer to these questions, it will from. Unstructured data includes no
be difficult to do anything meaningful annotations or metadata, including only
with your raw data without the risk of the original images, text or video.
losing information.

Unstructured Structured

However, it’s usually more comprehensive than structured

data. The vast majority of data is unstructured at the point of
collection, but it’s worth checking whether you’re primarily
collecting structured data or mixing the two.

The essential guide to AI training data - TELUS International 17

Feature extraction

Raw data usually contains a lot of noise, so it’s important to

separate the features you want from those which are irrelevant.
The process of feature extraction is useful when you want to
remove redundant data, as well as shrinking the amount of
data you need to process without losing valuable information.

Mitigate bias

Bias occurs when the data you have doesn’t accurately reflect
the conditions your model will operate in. It comes in many
forms, from gender or racial bias to observer and selection
bias. It’s extremely important that you avoid all forms of bias for
your model to work correctly.

The most successful AI projects treat data collection as an

iterative process. Throughout the life-cycle of your model,
collect data from any user engagement to improve your
model. After it has been refined, you’ll be able to use this new
data to improve accuracy and performance.

The essential guide to AI training data - TELUS International 18

Data cleansing
There’s a reason that 80% of a data scientist’s job
is preprocessing. How well you clean your data
determines its overall quality – and quality data
means quality results.

What is data cleansing?

Data cleansing is the process of preparing your data for

modeling, particularly through improving the quality of your
dataset. A clean dataset will have clear, precise parameters
and consistent processes for managing outliers and missing
values. In other words, it ensures that your data is accurate,
complete and relevant.

This is an essential step in the dataset preparation process,

largely because of the nature of raw, unannotated data. It’s
common for a recently collected dataset to be incomplete,
partially incorrect or contain a variety of formats. Training
a model on this can cause a variety of errors, which will
ultimately result in a much lower overall accuracy. As a result,
data cleansing tasks are primarily based around formatting
datasets, standardizing them and dealing with missing data.

The essential guide to AI training data - TELUS International 19

Common data cleansing problems

Since every model has a different task or specialism, the process of data
cleansing will look different from project to project. However, there are certain
data cleansing tasks which are required for every dataset. They target specific
structural problems that are present in many datasets and are extremely
important to ensure your project’s success. If you spot any of these problems in
your data, it’s crucial that you take action to solve them:

Duplication

Unsurprisingly, duplication involves data points or parameters that have somehow

made their way into your dataset more than once. This often happens when you
pull data from multiple sources or when you’re scraping data from a huge corpus,
such as Twitter. Duplicated data points don’t add any value to your model and may
actually lead it to make false conclusions. Make sure you remove any duplicate
data points from your dataset before you begin training.

@OsakaDave moved to Osaka @OsakaDave moved to Osaka

and exploring the city. Take a and exploring the city. Take a
detour to the hidden roads of detour to the hidden roads of
Osaka´s Dotonbori Osaka´s Dotonbori
#travelJapan #travelJapan

First data Duplicate data

Outliers

There are occasionally large groups of observations within

your dataset that aren’t actually related to the problem you’re
trying to solve. For example, you could find that a dataset of
product reviews captures the date of the review. It’s possible to
remove these outliers and improve your model’s performance,
but first you have to be absolutely certain that the observation
is irrelevant to your model. If you’ve determined that these
outliers are not an important source of information, you can
delete them.

The essential guide to AI training data - TELUS International 20

Structural errors

In some cases, you may have classes that have been mislabeled within your data,
leading to messy class distribution. For example: ‘USA’, ‘usa’, and ‘the US’ could all
be separate classes within a dataset. Inconsistent capitalization and similar errors
can lead to multiple classes within the dataset that denote the same thing. Fix
any typos, capitalization or mislabeled classes in your dataset for a cleaner, more
accurate distribution.

us USA

Missing values

Some of your data may have fields which are missing information. It’s important
not to ignore these, since most algorithms can’t accept missing values. How
you handle missing values will depend on the model you’re trying to build, the
variable which is missing data and how much data is missing. Based on these
considerations, there are several possible strategies, including deleting rows with
missing data, deleting the variable or replacing missing values. Make sure to do
some thorough analysis before making your choice.

Product Category
Harry Potter and the Chamber of Secrets Fiction - Young Adult

The Shining Fiction - Horror

The Odyssey Empty data

On the Origins of the Species Non-Fiction - Biology

The essential guide to AI training data - TELUS International 21

Data labeling
Of all the steps in the dataset creation process, data labeling is the
most likely to need a QA system. From developing guidelines to
monitoring annotators, it can be a pretty complex process. In fact,
many teams outsource data labeling to specialist providers just to
avoid the hassle.

If you need to assess data labeling quality yourself, there’s plenty

that you can do to make the process smoother. Below, we’ll share
some tips from our years of experience in data annotation on how
to measure quality, identify common labeling errors and develop
processes that will keep your data in great shape.

How can I measure data labeling quality?

Quality measurements for data consistent, there will be a lot of noise in

annotation revolve around two your dataset which your model might
interconnected ideas. The first is use to form judgements.
accuracy. This involves measuring the
annotations in your dataset against an This is most often a problem in
ideal set of annotations, which in turn subjective tasks, such as classification,
should accurately reflect the real-world but can occasionally be present in
conditions in which you plan to use your objective tasks too. Most datasets
model. If your labels aren’t accurate, then require a combination of both accuracy
your model won’t be able to deduce and consistency measurements to track
useful rules from the parameters within whether the annotations are improving
the data. the overall quality of the data. Here
are two common metrics that data
However, it’s also important to measure scientists use to measure accuracy and
the consistency of your annotations. consistency:
Consistency means that every person
in your team of annotators adds labels
in the same way. If your labels aren’t

The essential guide to AI training data - TELUS International 22

RELEVANT ELEMENTS

FALSE NEGATIVES TRUE NEGATIVES

1. F1 score

This measurement is commonly

used in machine learning to monitor
how well a model classifies data
through its precision and recall
scores. Precision checks how many True False
positives positives
of the model’s answers were actually
correct. Recall measures how many
correct answers the model returned
against how many it missed. The F1
score is the harmonic mean of these
two calculations, where a score of 1
indicates perfect precision and recall.
SELECTED ELEMENTS

How many selected How many relevant items

items are relevant? are selected?

Precision: Relevant:

2. Inter-annotator agreement

Also called inter-rater reliability, this measures the degree of agreement amongst
your team of annotators. This is particularly useful for subjective tasks that could
be labeled differently by a range of annotators, such as sentiment analysis.
There are several statistical methods for doing this, but one of the more popular
amongst data scientists is Krippendorff’s alpha.

The essential guide to AI training data - TELUS International 23

Best practices for labeling
It’s extremely important to monitor the quality of your dataset while
you annotate, but that doesn’t mean that it has to be extremely
difficult. In fact, there are a variety of processes you can put in place
to ensure that labeling improves your data. Here are some of the
more common ones:

Create a gold standard

A gold standard is a set of data that reflects the ideal labeled data for your task. It’s
usually put together by one of your data scientists, who understands exactly what
the dataset needs to achieve. Gold standards enable you to measure your team’s
annotations for accuracy, while also providing a useful reference point as you
continue to measure the quality of your output during the project. They can also be
used as test datasets to screen annotation candidates.

Use a small set of labels

Systems with a wide variety of labels

can negatively impact overall annotation
quality. This is because having a
large number of options can lead to
annotators becoming more indecisive
or confused at the differences between
options. For example, asking your
annotators to categorize celebrities
into “Actress – TV”, “Actress – Movie”,
and “Actress – Stage” could lead to a TV actress Movie actress Stage actress
high level of disagreement between
annotators. Instead, a smaller number
of possible labels leads to more reliable
results.

Perform ongoing statistical analysis

Statistical representations can enable you to easily pinpoint outliers in your newly-
annotated data for further review. While these often indicate human error from
annotation, you should confirm this before taking action. After all, genuine outliers
can be an important source of information for your model.

The essential guide to AI training data - TELUS International 24

Use multipass

This involves asking multiple annotators to label the same data point. Multipass
allows you to directly measure consistency within your pool of annotators and is
particularly useful for improving quality for subjective tasks. If done to a large enough
extent, it can allow you to make assumptions about the overall level of agreement in
your dataset.

DATA DATA

Review each annotator

Most data labeling projects spend a significant amount of time reviewing the output
of each annotator in the team. One way to do this is to implement self-agreement
checks, where you give the same annotator the same piece of data twice to see if
they label it the same way the second time around. By doing this multiple times with
different pieces of data, you can observe annotator consistency, begin to target
sections of the dataset which are likely to have errors and give your annotators
advice that improves the quality of their work.

DATA

The essential guide to AI training data - TELUS International 25

Hire a diverse team

The easiest way to combat annotation

bias is to have a diverse team of
annotators. By doing so, you can avoid
selection bias, where you only hire
annotators who fit your expected user
base rather than your actual user base.
This is particularly important when you’re
labeling data for a subjective task, such
as sentiment analysis.

Iterate continuously

As you annotate, it’s inevitable that you’ll

find new edge cases, a lack of clarity
between different classes, or even
problems with the quality of your raw
data. Resolving these will make your
dataset stronger. Make the most of
these solutions by disseminating the
information throughout your team and
updating your gold standard to reflect
any changes.

The essential guide to AI training data - TELUS International 26

Common annotation errors TASK TYPE PROBLEM EXAMPLE
by task type
Image annotation Image quality
Certain tasks have issues that often crop
up during annotation. Get a headstart by
resolving them straight away.
Occluded /
partial subjects

Overlapping
annotations

Sentiment Personal sentiment Pineapple on pizza is great

analysis over text sentiment Negative

Content Cultural
moderation differences

Audio classification Speaker demarcation

Background /
ambient noise

Paris gets real in her latest reality

Entity annotation Context entertainment venture. The latest

Maintaining quality
episode was shot in Paris

He broke his leg and had kneecap

While there are a lot of potential issues Machine translation Terminology surgery undergoing several

at each stage of a machine learning operations

project, there’s no big secret to building

high-quality datasets. If you can prepare
for and resolve most of the issues we’ve
Play bossanova

Training phrases Duplicates

outlined above, you’ll be well on your Play bossanova

way to developing a competent model.

In the end, refining quality is an ongoing
process. Treat it as such and build out Video annotation Inconsistency
quality processes for each stage of the
life cycle. You might be surprised at how
much your results improve.

The essential guide to AI training data - TELUS International 27

Where can I get more training data?
There are three main ways to get training data for machine learning
projects. The first is to explore free options via open datasets, online
machine learning forums and dataset search engines. The second
is to evaluate your internal options and see if there is a way to
repurpose the data you already have. Finally, the last and often most
efficient option is to outsource training data services from a third
party.

Free options

Pro Cons

• Free • Requires internal work hours to

find data
• Relatively easy to find
• Requires internal work hours to
raw data preprocess the data to suit your
model

• Some open source data can’t be

used for commercial purposes

Most of the time, open datasets consist of information that is publicly available
through government sites or social media. While there are an increasing number
of useful open datasets available online, there will be times where free options
can’t get you the training data you need. Luckily, there are other inexpensive ways
to create custom datasets for your specific use cases.

The essential guide to AI training data - TELUS International 28

Internal options
Pro Cons
• No need to consult third-party • Requires internal work hours
companies
• Must coordinate with multiple
• Minimal costs to find raw data departments

• Data augmentation requires

highly-trained data scientists to
implement properly

Before opting to outsource training data Before you look for datasets elsewhere,
services, you should first check to see you should try to repurpose the data you
what in-house options you have available already have to build a larger dataset.
and if they’ll help you to create the One common way to do this is through
datasets that you need. For example, if data augmentation. For image datasets
you’re building a chatbot to handle online especially, there are numerous simple
inquiries, you should get in touch with ways to increase your training data
your customer service department to through simple image rotations,
see if they have stored chat logs or email color contrasts and other image
threads you can use to train your model. manipulations.
Of course, data availability depends
highly on the problem you are trying to
solve with your machine learning project.

Paid options
Pro Cons

• Quick turnaround times • Outsourcing can be expensive

• All project management and QA

tasks are handled for you

• Data augmentation requires

highly-trained data scientists to
implement properly

Sometimes free and internal options aren’t able to provide you with machine
learning datasets at the scale and quality you require. In these cases, it’s often more
efficient to simply outsource training data from a data annotation company rather
than building a data collection and annotation infrastructure on your own. Luckily,
there are a variety of training data outsourcing options available to you.

The essential guide to AI training data - TELUS International 29

Outsourcing data collection
One option is to partner with a data collection company. For
example, if you are building a voice recognition system and
you require voice samples from 200 different people, you
could simply hire a company to record the audio files for you.

One of the main advantages of this method is that the

data collection company will handle all of the project
management tasks for you. From finding and training
contributors to reviewing the data for accuracy, your project
is completely managed by the training data company. All you
need to do is provide specific guidelines.

Outsourcing data annotation

If you have the data, but don’t have the tools or workforce
to annotate the data internally, you can offload all of your
annotation tasks by partnering with a data annotation
company. These companies can provide the raw data itself,
a platform for labeling the data and a trained workforce to
label the data for you. Companies like TELUS international
already have platforms built to collect and annotate data,
as well as a large, trained workforce that can annotate
hundreds of thousands of data points at scale.

Once again, the main advantage of partnering with a data

annotation company is that you don’t have to deal with
building a data annotation infrastructure from scratch. All
you have to do is build specific guidelines and QA protocols
for the company to follow.

The essential guide to AI training data - TELUS International 30

About TELUS International
AI Data Solutions
Incorporating AI into your business Our proprietary AI training platform, in
comes with a lot of boxes to check: combination with our AI Community
gather data, but not too much or too of more than one million professional
little. Make sure the data is properly annotators and linguists, has
formatted, but not biased. Determine enhanced AI systems across a range
how you want to label the data, while of applications, from advanced smart
simultaneously avoiding errors. The list products, to better search results, to
is long and extensive, which is why an more human-like bot interactions and
experienced partner in AI data solutions more.
can help guide the way.
Connect with us to learn how our
TELUS International is a leading global AI Data Solutions experts can
provider of scalable data solutions for customize the exact project to advance
text, image, video, audio and geo data, your machine learning needs.
as well as 3D sensor fusion to train
computer vision models. telusinternational.com

The essential guide to AI training data - TELUS International 31

Standards Alignment Guide - Intro To CS MakeCode Microbit
No ratings yet
Standards Alignment Guide - Intro To CS MakeCode Microbit
8 pages
Codesaif Chatgpt Prompt
No ratings yet
Codesaif Chatgpt Prompt
59 pages
Machine Learning - Applications, Process and Techniques
No ratings yet
Machine Learning - Applications, Process and Techniques
241 pages
BSBXCS402 Learner Guide V10.0
No ratings yet
BSBXCS402 Learner Guide V10.0
77 pages
Python Project Ideas For Beginners
No ratings yet
Python Project Ideas For Beginners
5 pages
ChatGPT - An Honest Manual
No ratings yet
ChatGPT - An Honest Manual
35 pages
HTML Fundermentals
No ratings yet
HTML Fundermentals
27 pages
Coding For Beginners - 11!09!2021
No ratings yet
Coding For Beginners - 11!09!2021
13 pages
Building Blocks of Rag Ebook Final
100% (2)
Building Blocks of Rag Ebook Final
9 pages
Build Your Own ChatGPT Presentation
0% (1)
Build Your Own ChatGPT Presentation
57 pages
He Makes $750 A Day Vibe Coding' Apps
No ratings yet
He Makes $750 A Day Vibe Coding' Apps
7 pages
Microscope Lab (Cells)
No ratings yet
Microscope Lab (Cells)
4 pages
Online Course Packet
100% (1)
Online Course Packet
15 pages
Bulletproof Your Career - Target List
No ratings yet
Bulletproof Your Career - Target List
6 pages
Software Engineering
No ratings yet
Software Engineering
2 pages
Why Educators Should Embrace Chat GPT
No ratings yet
Why Educators Should Embrace Chat GPT
3 pages
PDF Coding For Beginners Using Python DD
No ratings yet
PDF Coding For Beginners Using Python DD
49 pages
Data Science For Beginners - Sketch Notes
No ratings yet
Data Science For Beginners - Sketch Notes
13 pages
SSWDPP401 - PHP Programming
No ratings yet
SSWDPP401 - PHP Programming
20 pages
Introduction To Abductive Learning and Neuro-Symbolic (RL)
100% (1)
Introduction To Abductive Learning and Neuro-Symbolic (RL)
45 pages
ControlNet For Stable Diffusion
No ratings yet
ControlNet For Stable Diffusion
4 pages
DIY Cozmo Robot Expressions: Technology Workshop Craft Home Food Play Outside Costumes
No ratings yet
DIY Cozmo Robot Expressions: Technology Workshop Craft Home Food Play Outside Costumes
7 pages
How To Use A.I.: To Write and Illustrate More Content For Your Books and Products!
100% (1)
How To Use A.I.: To Write and Illustrate More Content For Your Books and Products!
31 pages
Ai Drive - Prompt Library
No ratings yet
Ai Drive - Prompt Library
4 pages
Countries of North America
No ratings yet
Countries of North America
10 pages
RAG Syllabus R&D
No ratings yet
RAG Syllabus R&D
6 pages
Swift C Sharp Poster
100% (1)
Swift C Sharp Poster
1 page
Guide To Artificial Intelligence Law - International Corporate Compliance-Edward Elgar (2024)
No ratings yet
Guide To Artificial Intelligence Law - International Corporate Compliance-Edward Elgar (2024)
217 pages
Javascript Grammar
100% (1)
Javascript Grammar
246 pages
I'm On LinkedIn, Now What
No ratings yet
I'm On LinkedIn, Now What
15 pages
Sped Praxis 0543 5543 - Overview
No ratings yet
Sped Praxis 0543 5543 - Overview
12 pages
AI Skills Assessment
No ratings yet
AI Skills Assessment
20 pages
The Olmecs
No ratings yet
The Olmecs
1 page
UW Coding Curriculum Overview
100% (1)
UW Coding Curriculum Overview
12 pages
Coding For Beginners
No ratings yet
Coding For Beginners
17 pages
Faculty of It
No ratings yet
Faculty of It
38 pages
Learn Python in 24 Hours For Beginners by S Basu
No ratings yet
Learn Python in 24 Hours For Beginners by S Basu
70 pages
Guide To Debit and Credits (Accounting)
No ratings yet
Guide To Debit and Credits (Accounting)
1 page
Curriculum Overview Booklet - Intro To CS MakeCode Microbit
No ratings yet
Curriculum Overview Booklet - Intro To CS MakeCode Microbit
9 pages
Business Model & Idea Canvasses: Licensed Under Creative Commons Attribution - Sharealike 4.0 International
No ratings yet
Business Model & Idea Canvasses: Licensed Under Creative Commons Attribution - Sharealike 4.0 International
9 pages
Insta-Millionaire: Profiting from AI-Generated Supermodel's on Social Media
From Everand
Insta-Millionaire: Profiting from AI-Generated Supermodel's on Social Media
tomelle
No ratings yet
HTML Email Design
No ratings yet
HTML Email Design
65 pages
Scraping 1000's of News Articles Using 10 Simple Steps - by Kajal Yadav - Jun, 2020 - Towards Data Science
No ratings yet
Scraping 1000's of News Articles Using 10 Simple Steps - by Kajal Yadav - Jun, 2020 - Towards Data Science
24 pages
AI Algorithms Explained To Kids 1717055132
No ratings yet
AI Algorithms Explained To Kids 1717055132
10 pages
Learning Path Machine Learning
No ratings yet
Learning Path Machine Learning
7 pages
1710988761593
100% (2)
1710988761593
169 pages
Vault Guide To Education Jobs, Third Edition
No ratings yet
Vault Guide To Education Jobs, Third Edition
581 pages
Example ChatGPT Prompts
No ratings yet
Example ChatGPT Prompts
4 pages
Css Cheat Sheet: Shorthand
No ratings yet
Css Cheat Sheet: Shorthand
1 page
Some Useful Linux Commands
100% (1)
Some Useful Linux Commands
21 pages
Text Generation - OpenAI API
No ratings yet
Text Generation - OpenAI API
12 pages
The 23 Gang of Four Design Patterns
No ratings yet
The 23 Gang of Four Design Patterns
5 pages
Maths in Daily Life
No ratings yet
Maths in Daily Life
5 pages
Artificial Intelligence)
No ratings yet
Artificial Intelligence)
27 pages
Cinder Creative Coding Cookbook
From Everand
Cinder Creative Coding Cookbook
Rui Madeira
No ratings yet
MLDD 1
No ratings yet
MLDD 1
44 pages
CS1010S Lecture 01 - Introduction
No ratings yet
CS1010S Lecture 01 - Introduction
173 pages
ChatGPT Cheatsheet 2
No ratings yet
ChatGPT Cheatsheet 2
2 pages
Implementing A Retrieval-Augmented Generation System
No ratings yet
Implementing A Retrieval-Augmented Generation System
3 pages
"Every Dollar Is A Soldier": GPS To Success
From Everand
"Every Dollar Is A Soldier": GPS To Success
Kevin Najafi
5/5 (1)
A Brief Survey of Deep Reinforcement Learning PDF
No ratings yet
A Brief Survey of Deep Reinforcement Learning PDF
14 pages
Final QB ML PT1
No ratings yet
Final QB ML PT1
2 pages
Characteristics of Artificial Neural Networks
No ratings yet
Characteristics of Artificial Neural Networks
38 pages
Does Audio Deepfake Detection Generalize?: Fraunhofer Aisec Technical University Munich Why Do Birds GMBH
No ratings yet
Does Audio Deepfake Detection Generalize?: Fraunhofer Aisec Technical University Munich Why Do Birds GMBH
5 pages
Research 4.0 Interim Report
No ratings yet
Research 4.0 Interim Report
29 pages
A Continual Few-Shot Learning Method Via Meta-Learning For Intrusion Detection
No ratings yet
A Continual Few-Shot Learning Method Via Meta-Learning For Intrusion Detection
7 pages
Generative AI Chatbot - Pdf.crdownload
No ratings yet
Generative AI Chatbot - Pdf.crdownload
63 pages
"Artificial Intelligence" Topic-A Modern Approach
No ratings yet
"Artificial Intelligence" Topic-A Modern Approach
14 pages
AI Based SLD From Load Schedule
No ratings yet
AI Based SLD From Load Schedule
5 pages
Neural Networks: Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
No ratings yet
Neural Networks: Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
18 pages
Datamining Bits
No ratings yet
Datamining Bits
16 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
70 pages
Abhishek Tiwari
No ratings yet
Abhishek Tiwari
10 pages
AI & ChatGPT
No ratings yet
AI & ChatGPT
64 pages
Dataset Security For Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses
No ratings yet
Dataset Security For Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses
37 pages
How AI Is Useful in Social Media Marketing
No ratings yet
How AI Is Useful in Social Media Marketing
8 pages
Gait Cycle Prediction Model Based On Gait Kinematic Using Machine Learning Technique For Assistive Rehabilitat
No ratings yet
Gait Cycle Prediction Model Based On Gait Kinematic Using Machine Learning Technique For Assistive Rehabilitat
12 pages
Language Model AI and International Commercial Arbitration
No ratings yet
Language Model AI and International Commercial Arbitration
42 pages
Be - Computer Engineering - Semester 7 - 2022 - November - Machine Learning ML Pattern 2019
No ratings yet
Be - Computer Engineering - Semester 7 - 2022 - November - Machine Learning ML Pattern 2019
3 pages
Crop Yield Estimation ML
No ratings yet
Crop Yield Estimation ML
5 pages
1 s2.0 S2666016425000878 Main
No ratings yet
1 s2.0 S2666016425000878 Main
37 pages
Research Paper
No ratings yet
Research Paper
12 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
52 pages
Educational Metamorphosis Journal Vol 2 No 1
No ratings yet
Educational Metamorphosis Journal Vol 2 No 1
150 pages
A. Rupasri (20NE1A0510) Sk. Rehamunnisha (20NE1A0539) D. Sai Supriya (20NE1A0542) Sk. Mohammad Fahim (20NE1A0551)
No ratings yet
A. Rupasri (20NE1A0510) Sk. Rehamunnisha (20NE1A0539) D. Sai Supriya (20NE1A0542) Sk. Mohammad Fahim (20NE1A0551)
20 pages
Rheumatic Heart Disease
No ratings yet
Rheumatic Heart Disease
12 pages
ICML - 2022 - Active Testing Sample-Efficient Model Evaluation
No ratings yet
ICML - 2022 - Active Testing Sample-Efficient Model Evaluation
11 pages
Estimation of Fuel Consumption: B. Tech Degree in Information Technology
No ratings yet
Estimation of Fuel Consumption: B. Tech Degree in Information Technology
15 pages
Lec 18
No ratings yet
Lec 18
6 pages
Searching For Better Test Case Prioritization Schemes: A Case Study of AI-assisted Systematic Literature Review
No ratings yet
Searching For Better Test Case Prioritization Schemes: A Case Study of AI-assisted Systematic Literature Review
25 pages