0% found this document useful (0 votes)
3 views

AI-ML Using Py

The document outlines the importance of labeling data for machine learning, specifically focusing on label encoding in Python using the sklearn library. It details the steps for encoding labels, including importing packages, defining sample labels, creating and training a label encoder, and checking performance through encoding and decoding processes. Additionally, it contrasts labeled and unlabeled data, emphasizing the need for human expertise in labeling and the concept of semi-supervised learning to enhance model performance.

Uploaded by

Chloe Tee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

AI-ML Using Py

The document outlines the importance of labeling data for machine learning, specifically focusing on label encoding in Python using the sklearn library. It details the steps for encoding labels, including importing packages, defining sample labels, creating and training a label encoder, and checking performance through encoding and decoding processes. Additionally, it contrasts labeled and unlabeled data, emphasizing the need for human expertise in labeling and the concept of semi-supervised learning to enhance model performance.

Uploaded by

Chloe Tee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

AI/ML Using Python

Preprocessing
Part 2
Labeling the Data
• We already know that data in a certain format is necessary for
machine learning algorithms. Another important requirement is that
the data must be labelled properly before sending it as the input of
machine learning algorithms.
• For example, if we talk about classification, there are lots of labels on
the data. Those labels are in the form of words, numbers, etc.
• Functions related to machine learning in sklearn expect that the data
must have number labels. Hence, if the data is in other form then it
must be converted to numbers.
• This process of transforming the word labels into numerical form is
called label encoding.
Labeling the Data
• Label encoding steps
• Follow these steps for encoding the data labels in Python:
• Step1: Importing the useful packages
• We need to import required packages to convert the data into certain
format. It can be done as follows:

import numpy as np

from sklearn import preprocessing


Labeling the Data
• Step 2: Defining sample labels
• After importing the packages, we need to define some sample labels
so that we can create and train the label encoder. We will now define
the following sample labels:

# Sample input labels

input_labels = ['red','black','red','green','black','yellow','white']
Labeling the Data
• Step 3: Creating & training of label encoder object
• In this step, we need to create the label encoder and train it. The
following Python code will help in doing this:
# Creating the label encoder
encoder = preprocessing.LabelEncoder()

encoder.fit(input_labels)
Labeling the Data
• Step4: Checking the performance by encoding random ordered list
• This step can be used to check the performance by encoding the
random ordered list. Following Python code can be written to do the
same:
# encoding a set of labels
test_labels = ['green','red','black']
encoded_values = encoder.transform(test_labels)
print("\nLabels =", test_labels)

• The labels would get printed as follows:

Labels = ['green', 'red', 'black']


Labeling the Data
Step4 – Cont.
• Now, we can get the list of encoded values i.e. word labels converted
to numbers as follows:

print("Encoded values =", list(encoded_values))

• The encoded values would get printed as follows:

Encoded values = [1, 2, 0]


Labeling the Data
• Step 5: Checking the performance by decoding a random set of numbers:
• This step can be used to check the performance by decoding the random set of
numbers. Following Python code can be written to do the same:

# decoding a set of values


encoded_values = [3,0,4,1]
decoded_list = encoder.inverse_transform(encoded_values)
print("\nEncoded values =", encoded_values)

• Now, Encoded values would get printed as follows:

Encoded values = [3, 0, 4, 1]


Labeling the Data
Step 5 – Cont.
• And we can decode the labels using following code:

print("\nDecoded labels =", list(decoded_list))

• Now, decoded values would get printed as follows:

Decoded labels = ['white', 'black', 'yellow', 'green']


Labeled v/s Unlabeled Data
• Unlabeled data mainly consists of the samples of natural or human-created
object that can easily be obtained from the world. They include, audio,
video, photos, news articles, etc.
• On the other hand, labeled data takes a set of unlabeled data and
augments each piece of that unlabeled data with some tag or label or class
that is meaningful. For example, if we have a photo then the label can be
put based on the content of the photo, i.e., it is photo of a boy or girl or
animal or anything else. Labeling the data needs human expertise or
judgment about a given piece of unlabeled data.
• There are many scenarios where unlabeled data is plentiful and easily
obtained but labeled data often requires a human/expert to annotate.
Semi-supervised learning attempts to combine labeled and unlabeled data
to build better models.

You might also like