0% found this document useful (0 votes)
9 views9 pages

Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community

This document is a beginner's guide to implementing a simple machine learning project using Python, specifically focusing on predicting customer categories for a telecommunications provider. It covers setting up the environment, data preparation, feature extraction, model training using the K-Nearest Neighbors algorithm, and evaluating model accuracy. The guide aims to simplify machine learning concepts for new learners and encourages ongoing exploration in the field.

Uploaded by

uduogah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views9 pages

Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community

This document is a beginner's guide to implementing a simple machine learning project using Python, specifically focusing on predicting customer categories for a telecommunications provider. It covers setting up the environment, data preparation, feature extraction, model training using the K-Nearest Neighbors algorithm, and evaluating model accuracy. The guide aims to simplify machine learning concepts for new learners and encourages ongoing exploration in the field.

Uploaded by

uduogah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

dev.

to

Beginner’s Guide to Implementing a Simple


Machine Learning Project
10–13 minutes

I completed few days ago the "Machine Learning with Python" course by
IBM on Coursera. And Because this field can seem quite challenging for
many, I decided to write a set of simple, concise, and fun articles to share
my knowledge and guide new learners in the machine learning field!

In this easy, fun, and simple-to-understand guide, I will unravel the mystery
of machine learning by helping you create your first project: predicting
customer categories for a telecommunications provider company.

Setting Up Your Environment

Before we begin, you can use Google Colab to run all the code provided
here. This will allow you to execute the code in the cloud and analyze the
output directly without using your machine's resources.

Project Overview

Imagine a telecommunications provider has segmented its customer base


by service usage patterns, categorizing the customers into four groups:

1. Basic Service

2. E-Service

3. Plus Service

4. Total Service.

Why Categorize (or Classify) Customers?


Categorizing customers allows a company to tailor offers based on specific
needs. For example, new customers might receive welcome discounts,
while loyal customers could get exclusive early access to sales.

Our objective in this project is to build a classifier that can classify new
customers into one of these four groups based on previously categorized
data. For this, we will use a specific type of classification algorithm called
K-Nearest Neighbors (kNN).

To learn more about different ML algorithms, check out this informative


article: Types of Machine Learning,
Anyway, let the fun begin!

Data

Of course, Machines learn from the data you give them, which is the
essence of machine learning algorithms. They analyze data, learn patterns
and relationships, and make predictions based on what they've learned.

Downloading and Understanding Our Dataset

To make it easier to understand our dataset, let's first download it and


output the first five rows using the Pandas library.

Pandas is a Python library used for working with data sets. It has functions
for analyzing, cleaning, exploring, and manipulating data.

So, go ahead and open Google Colab, create a new notebook, and in the
first code cell, type:

import pandas as pd
import matplotlib.pyplot as plt # For creating
visualizations and plots

and then run it to import the Pandas library.


Next, in the second code cell, type the following code to read the dataset
from the provided URL and display the first five rows:

df = pd.read_csv('https://cf-courses-data.s3.us.cloud-
object-storage.appdomain.cloud/
IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/
Module%203/data/teleCust1000t.csv')
df.head()

As you see in the first five rows, we have customers with attributes such as
region, age, and marital status. We will use these attributes to predict a
new case. In machine learning terminology, these are called features,
while the target field (the attribute we want to predict), called custcat (short
for customer category), which has four possible values corresponding to
the four customer groups we discussed earlier, is known as the label.

Let's perform a quick analysis with Pandas to see how many of each class
is in our dataset. Type the following in a new code cell to get the result:

df['custcat'].value_counts()

Extracting Features and Labels

We will use X variable to store our feature set, and Y to store our labels:

import numpy as np
X = df[['region', 'tenure', 'age', 'marital',
'address', 'income', 'ed', 'employ', 'retire',
'gender', 'reside']].values

The line of code above selects specific columns from the DataFrame df to
be used as features for our machine learning model.
To see a sample of the data. let's output the first five rows from the array X
using X[0:5]

array([[ 2., 13., 44., 1., 9., 64., 4., 5.,


0., 0., 2.],
[ 3., 11., 33., 1., 7., 136., 5., 5.,
0., 0., 6.],
[ 3., 68., 52., 1., 24., 116., 1., 29.,
0., 1., 2.],
[ 2., 33., 33., 0., 12., 33., 2., 0.,
0., 1., 1.],
[ 2., 23., 30., 1., 9., 30., 1., 2.,
0., 0., 4.]])

Now, Let's store the values of our label in Y

y = df['custcat'].values
y[0:5]

Data Standardization

Normalizing our data helps convert it into a uniform format. Why? Because
imagine in our raw data we have age in years, income in dollars, and height
in centimeters; the algorithm may give more importance to features with
larger scales. This can skew the results and lead to a biased model.

The line of code below scales and normalizes the data. It standardizes the
features so they have a mean of 0 and a standard deviation of 1. This is a
good practice in general.

First, make sure to install scikit-learn library

pip install scikit-learn

Scikit-learn is a library in Python that provides many unsupervised and


supervised learning algorithms
Then, Go ahead and the run the code below and see in the output how all
the values are unified on similar scale:

from sklearn import preprocessing


X =
preprocessing.StandardScaler().fit(X).transform(X.astype
X[0:5]

Again, don't worry if the line above seems complex; the important thing
right now is to understand the importance of it �.

Train-Test Split

Since we only work with one dataset, Train/Test Split involves splitting our
dataset into training data (the data we will give to our model to learn the
patterns of it) and testing data (the one we will use for predictions).

In future articles, I’ll cover various and more in-depth evaluation techniques
for models, including Train/Test Split, K-Fold Cross-Validation, and more.

So, To perform the Train/Test split, we import it from scikit-learn library

from sklearn.model_selection import train_test_split

And now let's use our imported function and explain what is happenig in the
line below:

X_train, X_test, y_train, y_test = train_test_split( X,


y, test_size=0.2, random_state=4)

train_test_split() divides the features (X) and labels (y) into two
parts: 80% for training the model (X_train and y_train) and 20% for testing
it (X_test and y_test). The random_state=4 ensures that each time you run
the code, the split result is the same.

To see the size (number of rows and columns) of our train and test dataset,
we will use the .shape attribute:
print('Train set:', X_train.shape, y_train.shape)
print('Test set:', X_test.shape, y_test.shape)

Classification

As I mentioned before, we will use K-Nearest Neighbors or (KNN) algorithm


to train and to predict new classes.
(If you want to know how this algorithm works, just leave a comment; I’ll be
glad to write an article about it �)

Good news!, We don't need to implement the algorithm by ourselves as we


can easily import the model from the scikit-learn library:

from sklearn.neighbors import KNeighborsClassifier

Training

The block of code below creates a K-Nearest Neighbors model, The fit()
method trains the model with the training data (X_train) and its
corresponding labels (y_train), allowing the model to learn from this
data:

k = 4
neigh = KNeighborsClassifier(n_neighbors =
k).fit(X_train,y_train)

Predicting

Now, let's make some predictions!


Let's create a variable named y_predict to store our model’s predictions
on the test data we created earlier, We will also use this variable to
compare its results with y_test which has the real and correct values.
This means we will compare our prediction with the real values to test our
model’s accuracy.
y_predict = neigh.predict(X_test)
y_predict[0:5]

Accuracy evaluation

The y_predict[0:5] will output the first five prediction results, you can
do some comparison yourself with y_test.

But, Comparing results manually can be error-prone and inconsistent,


That’s why Scikit-learn offers various accuracy evaluation methods that
standardize this process. For example, metrics.accuracy_score
calculates the proportion of correct predictions, providing a clear and
objective measure of how well your model performs.

Let's use metrics.accuracy_score to see how well our trained model


is working:

from sklearn import metrics

print("Train set Accuracy: ",


metrics.accuracy_score(y_train,
neigh.predict(X_train)))
print("Test set Accuracy: ",
metrics.accuracy_score(y_test, y_predict))

Our code output accuracy scores says that the model correctly predicted
the labels for about 54.75% of the training data and correctly predicted the
labels for about 32% of the test data.

The large gap between training and test accuracy might suggest
overfitting, where the model learns the training data too well but struggles
with new, unseen data.

Of course, If time permits, In future articles, I will discuss strategies to


improve a model’s accuracy on new, unseen data.

Conclusion �
I really hope this guide was easy to follow and I hope it helped you learn
something new.

Always keep experimenting and exploring new techniques. There’s always


something new to learn in machine learning, and I’m excited to share more
with you in future articles.

Feel free to drop your thoughts or questions in the comments. I’m here to
help and would love to hear about your experiences and progress.

Happy coding �

Bye for now! �

You might also like