Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community
Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community
to
I completed few days ago the "Machine Learning with Python" course by
IBM on Coursera. And Because this field can seem quite challenging for
many, I decided to write a set of simple, concise, and fun articles to share
my knowledge and guide new learners in the machine learning field!
In this easy, fun, and simple-to-understand guide, I will unravel the mystery
of machine learning by helping you create your first project: predicting
customer categories for a telecommunications provider company.
Before we begin, you can use Google Colab to run all the code provided
here. This will allow you to execute the code in the cloud and analyze the
output directly without using your machine's resources.
Project Overview
1. Basic Service
2. E-Service
3. Plus Service
4. Total Service.
Our objective in this project is to build a classifier that can classify new
customers into one of these four groups based on previously categorized
data. For this, we will use a specific type of classification algorithm called
K-Nearest Neighbors (kNN).
Data
Of course, Machines learn from the data you give them, which is the
essence of machine learning algorithms. They analyze data, learn patterns
and relationships, and make predictions based on what they've learned.
Pandas is a Python library used for working with data sets. It has functions
for analyzing, cleaning, exploring, and manipulating data.
So, go ahead and open Google Colab, create a new notebook, and in the
first code cell, type:
import pandas as pd
import matplotlib.pyplot as plt # For creating
visualizations and plots
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-
object-storage.appdomain.cloud/
IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/
Module%203/data/teleCust1000t.csv')
df.head()
As you see in the first five rows, we have customers with attributes such as
region, age, and marital status. We will use these attributes to predict a
new case. In machine learning terminology, these are called features,
while the target field (the attribute we want to predict), called custcat (short
for customer category), which has four possible values corresponding to
the four customer groups we discussed earlier, is known as the label.
Let's perform a quick analysis with Pandas to see how many of each class
is in our dataset. Type the following in a new code cell to get the result:
df['custcat'].value_counts()
We will use X variable to store our feature set, and Y to store our labels:
import numpy as np
X = df[['region', 'tenure', 'age', 'marital',
'address', 'income', 'ed', 'employ', 'retire',
'gender', 'reside']].values
The line of code above selects specific columns from the DataFrame df to
be used as features for our machine learning model.
To see a sample of the data. let's output the first five rows from the array X
using X[0:5]
y = df['custcat'].values
y[0:5]
Data Standardization
Normalizing our data helps convert it into a uniform format. Why? Because
imagine in our raw data we have age in years, income in dollars, and height
in centimeters; the algorithm may give more importance to features with
larger scales. This can skew the results and lead to a biased model.
The line of code below scales and normalizes the data. It standardizes the
features so they have a mean of 0 and a standard deviation of 1. This is a
good practice in general.
Again, don't worry if the line above seems complex; the important thing
right now is to understand the importance of it �.
Train-Test Split
Since we only work with one dataset, Train/Test Split involves splitting our
dataset into training data (the data we will give to our model to learn the
patterns of it) and testing data (the one we will use for predictions).
In future articles, I’ll cover various and more in-depth evaluation techniques
for models, including Train/Test Split, K-Fold Cross-Validation, and more.
And now let's use our imported function and explain what is happenig in the
line below:
train_test_split() divides the features (X) and labels (y) into two
parts: 80% for training the model (X_train and y_train) and 20% for testing
it (X_test and y_test). The random_state=4 ensures that each time you run
the code, the split result is the same.
To see the size (number of rows and columns) of our train and test dataset,
we will use the .shape attribute:
print('Train set:', X_train.shape, y_train.shape)
print('Test set:', X_test.shape, y_test.shape)
Classification
Training
The block of code below creates a K-Nearest Neighbors model, The fit()
method trains the model with the training data (X_train) and its
corresponding labels (y_train), allowing the model to learn from this
data:
k = 4
neigh = KNeighborsClassifier(n_neighbors =
k).fit(X_train,y_train)
Predicting
Accuracy evaluation
The y_predict[0:5] will output the first five prediction results, you can
do some comparison yourself with y_test.
Our code output accuracy scores says that the model correctly predicted
the labels for about 54.75% of the training data and correctly predicted the
labels for about 32% of the test data.
The large gap between training and test accuracy might suggest
overfitting, where the model learns the training data too well but struggles
with new, unseen data.
Conclusion �
I really hope this guide was easy to follow and I hope it helped you learn
something new.
Feel free to drop your thoughts or questions in the comments. I’m here to
help and would love to hear about your experiences and progress.
Happy coding �