Analysis and Prediction of Customer Segmentation Using Behavioral Data (1)
Analysis and Prediction of Customer Segmentation Using Behavioral Data (1)
DEEPA .N (22UGIT013)
Under the Guidance of
(ASSISTANT PROFESSOR)
(Autonomous)
(Reaccredited by NAAC with “A+” Grade, ISO 9001-2015& ISO 14001 : 2004
Certified)RECOGNIZED BY UGC & AFFILIATED TO BHARATHIAR
UNIVERSITY “NEHRU GARDENS”, T. M. PALAYAM, COIMBATORE –
641105.
MARCH 2025
ANALYSIS AND PREDICTION OF CUSTOMER
SEGMENTATION USING BEHAVIORAL DATA
Submitted by
DEEPA .N (22UGIT013)
Under the Guidance of
(ASSISTANT PROFESSOR)
(Autonomous)
(Reaccredited by NAAC with “A+” Grade, ISO 9001-2015& ISO 14001 : 2004
Certified)RECOGNIZED BY UGC & AFFILIATED TO BHARATHIAR
UNIVERSITY “NEHRU GARDENS”, T. M. PALAYAM, COIMBATORE –
641105.
MARCH 2025
DECLARATION
DECLARATION
I, DEEPA .N (22UGIT013) hereby declare that the Internship work entitled “Python
with Machine learning” submitted to Bharathiar University in partial fulfilment for the
award of the Bachelor Degree of Information Technology is an independent Project
report done by me during the Project duration of the period of study in Nehru Arts and
Science College, Coimbatore (Recognized by UGC &Affiliated to Bharathiar University)
under the guidance of MR. T. MARIA MAHAJAN during the academic year 2024-25.
PLACE : COIMBATOR
DATE:
BONAFIDE CERTIFICATE
CERTIFICATE
This is to certify that the Project report entitled “Customer segment: analysis and
prediction of customer segmentation using behavioral data” is a bonafied work done
by DEEPA N (22UGIT013) in partial fulfilment of the requirement of the award of the
degree of Bachelor of Science in Information technology, Bharathiar University, Coimbatore
during the academic year 2024-25
Certify that we examined the Candidate in the Internship Work / Viva -Voce Examination
held at NEHRU ARTS AND SCIENCE COLLEGE on
__________________________
Internal Examiner External
Examiner
COMPANY CERTIFICATE
COMPANY CERTIFICATE
9
ACKNOWLEDGEMENT
10
ACKNOWLEDGEMENT
I solemnly take this opportunity to all the helping hands that made me accomplish this
project. First and foremost, I thank the Almighty who is the source of knowledge and one
who guided me for completing the project work successfully.
I sincerely thank our respected CEO & Secretary, Adv. Dr. P. KRISHNAKUMAR M.A.,
M.B.A., Ph.D., for his invaluable support and for providing the best academic environment to
undertake this project as part of the curriculum
I sincerely thank our respected Principal Dr. B. ANIRUDHAN M.A., B.Ed., M.Phil., Ph.D.
Nehru Arts and Science College for permitting me to undertake this project work as a part of
curriculum and for giving me the best facilities and infrastructure for the completion of the
course and project.
My sincere gratitude to our Dean Dr. K. SELVAVINAYAKI M.C.A, M. Phil., Ph.D. for her
support and guidance to complete the project work successfully.
My sincere thanks to Dr. J. MARIA SHYLA M.C.A, M. Phil., Ph.D. Head of the
Department,
for her support and the motivation to complete the project work successfully.
My immense gratitude to my guide, Mr. T. MARIA MAHAJAN M.C.A., M. Phil. for her
continuous support, encouragement and the guidance to complete the project work successfully.
I express my sincere words of gratitude to our department staff members for their motivation to
complete the project work successfully.
I extend my sincere thanks to my parents all my friends for their moral support rendered to
complete the project work in a grand success.
DEEPA N
11
ABSTRACT
This study focuses on customer segmentation analysis and the prediction of future segmentation
using behavioral data. The analysis leverages various data mining and machine learning
techniques to identify patterns and trends within customer interactions, purchase behaviors, and
engagement metrics.
The results provide valuable insights into customer preferences, enabling companies to
implement targeted marketing strategies, improve customer retention, and enhance overall
customer satisfaction.
This approach not only improves customer targeting but also allows businesses to anticipate
shifts in customer behavior and adapt proactively to changing market dynamics.
In conclusion, this study demonstrates the significant potential of utilizing behavioral data for
effective customer segmentation and predictive analysis. By applying advanced machine
learning algorithms, we successfully identified distinct customer groups based on their
behavioral patterns, providing valuable insights into customer preferences and tendencies.
Key words:
TABLE OF CONTENT
12
S.No Contents Page.No
BONOFIDE CERTIFICATE
ACKNOWLEDGEMENT
DECLARATION
ABSTRACT
01 INTRODUCTION
02 SYSTEM REQUIREMENTS
03 SYSTEM STUDY
04 SYSTEM DESIGN
4.1 MODULES
13
4.3 DATASET DESIGN
05 PROGRAM SPECIFICATION
5.1 PYTHON
5.2.3 INSTALLATION
06 FEASABILITY STUDY
14
6.1 FEASABILITY ANALYSIS
07 SYSTEM TESTING
09 BIBLIOGRAPHY
10 SAMPLE CODE
10.1 OUTPUT
15
16
01. INTRODUCTION
The detection and classification of underwater objects, such as submarine rocks and naval
mines, is a critical challenge in maritime security and defense. Accurate identification of
these objects is essential to prevent accidents, protect naval assets, and ensure safe
navigation in underwater environments. Traditional detection methods, relying on sonar
and acoustic signals, often face limitations in distinguishing between natural geological
formations and man-made threats, leading to false alarms and inefficiencies.
More generally, data-driven decision making is way for businesses to make sure their
next move will benefit both them and their customers. Almost every company, especially in
the Tech ecosystem has now put into place a tracking process, to gather data related to
their customers’ behavior. The data to track varies along with the specific business model
of each company and the problem service they aim to address. By analyzing how, when and
why customers behave a certain way, it is possible to predict their next steps and have time
to work on fixing issues beforehand.
Churn prediction is the activity of trying to predict the phenomena of loss of customers.
This prediction and quantification of the risk of losing customers can be done globally or
individually and is mainly used in areas where the product or service is marketed on a
subscription basis. The prediction of churn is generally done by studying consumer
behaviour or by observing individual behaviour that indicates a risk of attrition. It involves
the use of modelling and machine learning techniques that can sometimes use a
considerable amount of data.
17
1.1 Context and Terminology :
For now, product usage is not precisely tracked yet on the product side but a trend can be
highlighted by just having a look at the number of calls that are made by the customer
(inbound calls), received by the customer (outbound calls) and the number of integrations
they configured in their Aircall account, as well as the evolution of these metrics over time
for a given customer.
Finally, the customers can assess the quality of Aircall’s product and service by two
different means. First, a form is sent to each of their user every 3 months where they can
express how much they would recommend the product to someone (ranking from 0 to 10,
0 being the most negative answer). Depending on their grade, they are then qualified as
being either promoter (graded 9 or 10), detractor (from 0 to 6) or neutral. Aircall then
compute the Net Promoter Score (NPS ) which is calculated by subtracting the percentage
of customers who are Detractors from the percentage of customers who are Promoters. An
NPS can be as low as -100 (every respondent is a detractor) or as high as +100 (every
respondent is a promoter). The business teams at Aircall are divided into five groups: Sales,
Onboarding, Customer Success, Support and Marketing. They all have their importance at
different moment during customer lifetime. First, the Sales team is in charge of finding
potential clients, make sure they are qualified (meaning Aircall’s product could actually be
useful for them), and signing the deal. Once the targeted company has become a customer,
the Onboarding team has to help them configuring the product on their own system so that
they can have the best experience with it. From the time the company becomes a customer,
it becomes the responsibility of the Customer Success team. They are the point of contact
between Aircall and their
customers, and are the key stakeholder when talking about hurn. Indeed, their job is split
into trying to upgrade the customers to a better plan, or having it adding more users, and
preventing them from churning. Finally, the Support team is the point of contact for any
18
kind of technical issues. They can be reached to through tickets, chat or phone. Aircall’s
customers are divided into two distinct categories depending on how much they bring to
the company. The ones that represent more than $1K of monthly recurring revenue are
defined as VIP accounts, and the other ones as Mid-Market accounts. VIP accounts
represent less than 10% of Aircall’s total number of customers but 50% of total monthly
recurring revenue and they are assigned to 70% of the Customer Success team. All these
accounts are carefully cared about so the Customer Success team successfully manage to
prevent their churn. On the contrary, there are too many Mid-Markets accounts, and too
little Customer Success Managers to handle their churn. Human resources are too limited
to conduct this work which is why the company decided to invest in some Data Analysis
and Machine Learning to get insights about the Mid-Market accounts, and give those
information to Customer Success Managers so that they can contact the sensitive
customers before they leave.
● RAM Capacity : 4 GB
● Hard Disk : 90 GB
● Speed : 3.3GHZ
19
● Operating System : Windows 10
System study contains existing and proposed system details. Existing system is useful to
develop proposed system. To elicit the requirements of the system and to identify the
elements, Inputs, Outputs, subsystems and the procedures, the existing system had to be
examined and analysed in detail.
This increases the total productivity. The use of paper files is avoided and all the data are
efficiently manipulated by the system. It also reduces the space needed to store the larger
paper files and records.
● The prototype developed was a smart device thatcollected inputs at the end of the
day's business directly from sales datarecords and automatically modified
segmentation statistics.
● An ANOVAanalysis was also performed to test the cluster's stability. The actual sales
figures of the day to day are contrasted with the model's expected statistics.
20
3.2 PROPOSED SYSTEM:
● Such behaviours link to their knowledge of, attitude toward, use of, or spending
score or response to a product.
The degree of interest in each concept has varied over the year, each has stood the test of
time. Each provides the software designer with a foundation from which more
sophisticated design methods can be applied. Fundamental design concepts provide the
necessary framework for “getting it right”.
During the design process the software requirements model is transformed into design
models that describe the details of the data structures, system architecture, interface, and
components. Each design product is reviewed for quality before moving to the next phase
of software development.
4.1 MODULES:
DESCRIPTION OF MODULES:
● Data Preprocessing
● Data Exploration
● Data Cleaning
● Data Modelling
● Feature Engineering
DATASET PREPROCESSING:
21
Data Processing
Flatten
tempora:
When
Categorical data:
22
considered. Instead, the extreme value is replaced by the mean value, or most represented
value of the related feature.
DATA EXPLORATION:
Data exploration is an informative search used by data consumers to form true analysis
from the information gathered. Data exploration is used to analyse the data and
information from the data to form true analysis. After having a look at the dataset, certain
information about the data was explored. Here the dataset is not unique while collecting
the dataset. In this module, the uniqueness of the dataset can be created.
DATA CLEANING:
In data cleaning module, is used to detect and correct the inaccurate dataset. It is used to
remove the duplication of attributes. Data cleaning is used to correct the dirty data which
contains incomplete or outdated data, and the improper parsing of record fields from
disparate systems. It plays a significant part in building a model.
DATA MODELLING:
In data modelling module, the machine learning algorithms were used to predict the Wave
Direction. Linear regression and K-means algorithm were used to predict various kinds of
waves. The user provides the ML algorithm with a dataset that includes desired inputs and
outputs, and the algorithm finds a method to determine how to arrive at those results.
Linear regression algorithm is a supervised learning algorithm. It implements a statistical
model when relationships between the independent variables and the dependent variable
are almost linear, shows optimal results. This algorithm is used to show the direction of
waves and its height prediction with increased accuracy rate.
K-means algorithm is an unsupervised learning algorithm. It deals with the correlations
and relationships by analysing available data. This algorithm clusters the data and predict
the value of the dataset point. The train dataset is taken and are clustered using the
algorithm. The visualization of the clusters is plotted in the graph.
FEATURE ENGINEERING:
In the feature engineering module, the process of using the import data into machine
learning algorithms to predict the accurate directions. A feature is an attribute or property
shared by all the independent products on which the prediction is to be done. Any attribute
could be a feature, it is useful to the model.
23
Data flow diagrams are used to graphically represent the flow of data in a business
information system. DFD describes the processes that are involved in a system to transfer
data from the input to the file storage and reports generation. Data flow diagrams can be
divided into logical and physical. The logical data flow diagram describes flow of data
through a system to perform certain functionality of a business. The physical data flow
diagram describes the implementation of the logical data flow.
DFD graphically representing the functions, or processes, which capture, manipulate, store,
and distribute data between a system and its environment and between components of a
system. The visual representation makes it a good communication tool between User and
System designer. The objective of a DFD is to show the scope and boundaries of a
system.The DFD is also called as a data flow graph or bubble chart. It can be manual,
automated, or a combination of both. It shows how data enters and leaves the system, what
changes the information, and where data is stored.
+-------------------+
| Customer Database |
+-------------------+
|
v
+-----------------------------+
| Data Collection & Integration|
+-----------------------------+
|
v
+-----------------------------+
| Data Preprocessing |
| (Cleaning & Transformation) |
+-----------------------------+
|
v
+-----------------------------+
| Customer Segmentation |
24
| (Clustering: K-means, DBSCAN)|
+-----------------------------+
|
v
+-----------------------------+
| Prediction Modeling |
| (Churn, Purchase Likelihood)|
+-----------------------------+
|
v
+-----------------------------+
| Analysis & Insights |
| (Model Evaluation) |
+-----------------------------+
|
v
+-----------------------------+ +-------------------------+
| Reporting & Visualization | <----> | Business/Marketing Team |
| (Final Results, Recommendations) | (Strategic Decisions) |
+-----------------------------+ +-------------------------+
^
|
Insights & Reports
This phase contains the attributes of the dataset which are maintained in the database
table. The dataset collection can be of two types namely train dataset and test dataset.
● The design of input focus on controlling the amount of dataset as input required,
25
avoiding
● delay and keeping the process simple. The input is designed in such a way to
provide security.
● A quality output is one, which meets the requirement of the user and presents the
information clearly. In output design, it is determined how the information is to be
displayed for immediate need.
5.1 PYTHON:
Python is an interpreted, object-oriented, high-level programming language with dynamic
semantics. Its high-level built in data structures, combined with dynamic typing and
dynamic binding, make it very attractive for Rapid Application Development, as well as for
use as a scripting or glue language to connect existing components together. Python's
simple, easy to learn syntax emphasizes readability and therefore reduces the cost of
program maintenance. Python supports modules and packages, which encourages program
modularity and code reuse. The Python interpreter and the extensive standard library are
available in source or binary form without charge for all major platforms, and can be freely
distributed.
26
5.1.1 PYTHON FEATURES :
Python has few keywords, simple structure, and a clearly defined syntax. Python code is
more clearly defined and visible to the eyes. Python's source code is fairly easy-to-
maintaining. Python's bulk of the library is very portable and cross-platform compatible on
UNIX, Windows, and Macintosh. Python has support for an interactive mode which allows
interactive testing and debugging of snippets of code.
Portable Python can run on a wide variety of hardware platforms and has the same
interface on all platforms.
● Extendable :It allows to add low-level modules to the Python interpreter. These
modules enable programmers to add to or customize their tools to be more efficient.
● GUI Programming :Python supports GUI applications that can be created and
ported to many system calls, libraries and windows systems, such as Windows MFC,
Macintosh, and the X Window system of Unix.
● Scalable :Python provides a better structure and support for large programs than
shell scripting.
● Highly Dynamic : Python is one of the most dynamic languages available in the
industry today. There is no need to specify the type of the variable during coding,
thus saving time and increasing efficiency.
● Extensive Array of Libraries : Python comes inbuilt with many libraries that can
be imported at any instance and be used in a specific program.
27
several libraries and packages can help with clustering, classification, and analysis. Below
are some of the commonly used Python packages for this task:
KMeans (Scikit-learn):
DBSCAN (Scikit-learn):
AgglomerativeClustering (Scikit-learn):
28
● PCA (Principal Component Analysis - Scikit-learn):
● XGBoost:
● LightGBM:
● CatBoost:
● RandomForestClassifier (Scikit-learn):
MLxtend:
29
customer behavior.
pip install mlxtend
apyori:
TensorFlow / Keras:
PyTorch:
Geopandas:
Folium:
Statsmodels:
Prophet:
30
pip install prophet
2. Feature Engineering:
Apply dimensionality reduction using PCAor t-SNE.
3. Segmentation:
Use KMeans or DBSCAN to cluster customers.
4. Modeling:
Apply predictive models like XGBoost or RandomForestClassifier to predict segments
based on behavioral data.
5. Visualization:
`python
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
Load data
data = pd.read_csv('customer_data.csv')
31
plt.scatter(data['feature1'], data['feature2'], c=data['Cluster'])
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Customer Segments')
plt.show()
This chapter covers all the basic I/O functions available in Python.
Here is a Python script that demonstrates how to perform customer segmentation analysis
and prediction using behavioral data. It covers reading input data from a CSV file,
preprocessing the data, performing clustering (KMeans in this case), and saving the results
to an output CSV file.
python
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
data = pd.read_csv(file_path)
print(f"Data loaded successfully from {file_path}")
return data
except FileNotFoundError:
32
scaled_data = scaler.fit_transform(data[features])
return scaled_data
plt.figure(figsize=(8, 6))
plt.scatter(pca_components[:, 0], pca_components[:, 1], c=data['Cluster'], cmap='viridis')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('Customer Segments')
plt.colorbar(label='Cluster')
plt.show()
try:
data.to_csv(output_file, index=False)
print(f"Customer segmentation results saved to {output_file}")
except Exception as e:
print(f"Error saving file: {e}")
33
scaled_data = preprocess_data(data, features)
Save the segmented data with cluster labels to a new CSV file
save_to_csv(data, output_file)
if __name__ == "__main__":
# Define the input and output file paths
input_file = 'customer_data.csv' # Replace with the path to your input CSV file
output_file = 'customer_segmentation_results.csv' # Output file to save the results
1. Loading Data:
The `load_data` function loads the behavioral customer data from a CSV file using
`pandas.read_csv()`.
The file path is passed as an argument, and the function returns the loaded data.
2. Preprocessing Data:
The `preprocess_data` function scales the customer data using `StandardScaler` from
Scikit-learn to standardize the feature values.
Scaling ensures that the clustering algorithm (KMeans) doesn't give too much weight to
any one feature due to differences in magnitude.
3. Clustering:
34
4. Visualization:
The `visualize_clusters` function uses PCA (Principal Component Analysis) to reduce the
dimensionality of the feature set to 2D, allowing the visualization of the clustering results
on a 2D plot.
The clusters are color-coded to easily distinguish between different segments.
5. Saving Results:
The `save_to_csv` function saves the segmented data (including the cluster labels) to a
new CSV file.
6. Main Function:
The `main` function is the driver that calls the other functions: loading the data,
preprocessing, clustering, visualizing, and saving the results to a file.
This script assumes the data has specific behavioral features (like `feature1`, `feature2`,
`feature3`). You'll need to modify these to match the actual columns in your dataset.
Ensure you have a CSV file (e.g., `customer_data.csv`) with behavioral data. The file should
include columns that represent customer features (e.g., purchase behavior, usage
frequency, etc.).
Replace `['feature1', 'feature2', 'feature3']` in the `features` variable with the actual column
names from your dataset.
35
4. Results:
The customer data will be clustered into segments and saved as a new CSV file
(`customer_segmentation_results.csv`).
A 2D plot of the clusters will also be displayed.
Dependencies:
You will need to install the following Python libraries:
bash
pip install pandas scikit-learn matplotlib
Step-by-step approach:
1. Class for Data Preprocessing: To handle the cleaning and preparation of the behavioral
data.
2. Class for Segmentation: To apply clustering techniques like K-Means or DBSCAN for
customer segmentation.
3. Class for Prediction: To train a machine learning model to predict the customer segments
based on the data.
4. Object to hold data: Represent data such as customer IDs, behaviors, and clusters.
```python
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
def clean_data(self):
36
"""Function to clean the data (e.g., handling missing values)."""
self.data.dropna(inplace=True) # Dropping missing values
return self.data
def feature_scaling(self):
"""Scale features for machine learning models."""
scaler = StandardScaler()
scaled_data = scaler.fit_transform(self.data)
return scaled_data
def assign_clusters(self):
"""Add cluster labels to the original dataset."""
self.data['Cluster'] = self.clusters
return self.data
# Prediction Class
class CustomerPrediction:
def __init__(self, data, target):
self.data = data
self.target = target
self.model = None
def train_model(self):
"""Train a model to predict customer segments based on features."""
X = self.data.drop(columns=[self.target])
y = self.data[self.target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
37
self.model.fit(X_train, y_train)
# 1. Data Preprocessing
preprocessor = DataPreprocessing(data)
clean_data = preprocessor.clean_data()
scaled_data = preprocessor.feature_scaling()
# 2. Customer Segmentation
segmentation = CustomerSegmentation(scaled_data)
clusters = segmentation.apply_kmeans(n_clusters=3)
segmented_data = segmentation.assign_clusters()
# 3. Customer Prediction
predictor = CustomerPrediction(segmented_data, target='Cluster')
accuracy = predictor.train_model()
print(f"Model Accuracy: {accuracy * 100:.2f}%")
Explanation:
1.
38
DataPreprocessing Class:
2. CustomerSegmentation Class:
The `apply_kmeans()` function applies the KMeans algorithm to segment customers based
on behavioral features (like Age, Income, Spending Score).
The `assign_clusters()` function adds the cluster labels to the dataset for further analysis.
3. CustomerPrediction Class:
The `train_model()` function splits the data into training and testing sets, then trains a
RandomForest model to predict customer segments based on the behavioral features.
The `predict_segment()` function predicts the customer segment for a new customer.
Workflow:
You can modify the dataset and the clustering technique to fit your actual use case and
dataset.
Syntax :
Derived classes are declared much like their parent class; however, a list of base classes to
inherit from
is given after the class name −
class SubClassName (ParentClass1[, ParentClass2, ...]):
'Optional class documentation string'
class_suite
Overriding Methods
You can always override your parent class methods. One reason for overriding parent's methods
is
because you may want special or different functionality in your subclass.
39
Example :
def __add__(self,other):
return Vector(self.a + other.a, self.b + other.b)
v1 = Vector(2,10)
v2 = Vector(5,-2)
print v1 + v2
40
When the above code is executed, it produces the following result –
Data Hiding
An object's attributes may or may not be visible outside the class definition. You need to name
attributes
with a double underscore prefix, and those attributes then are not be directly visible to outsiders.
lass JustCounter:
__secretCount = 0
def count(self):
self.__secretCount += 1
print self.__secretCount
counter = JustCounter()
counter.count()
counter.count()
print counter.__secretCount
1
2
Traceback (most recent call last):
File "test.py", line 12, in <module>
print counter.__secretCount
AttributeError: JustCounter instance has no attribute '__secretCount'
Python protects those members by internally changing the name to include the class name. You
can
access such attributes as object._className__attrName. If you would replace your last line as
following,
then it works for you –
1.NumPy :
”NumPy is a library for the Python programming language, adding support for large, multi-
dimensional arrays and matrices, along with a large collection of high-level mathematical func-
tions to operate on these arrays”. The previous similar programming of NumPy is Numeric, and
this language was originally created by Jim Hugunin with contributions from several other
developers. In 2005, Travis Oliphant created NumPy by incorporating features of the competing
Numarray into Numeric, with extensive modifications. [12] It is an open source library and free
of cost.
2. Pandas :
Pandas is also a library or a data analysis tool in python which is written in python program-
ming language. It is mostly used for data analysis and data manipulation. It is also used for data
structures and time series.
We can see the application of python in many fields such as - Economics, Recommendation
Systems - Spotify, Netflix and Amazon, Stock Prediction, Neuro science, Statistics, Advertising,
Analytics, Natural Language Processing. Data can be analyzed in pandas in two ways -
Data frames - In this data is two dimensional and consist of multiple series. Data is always
represented in rectangular table.
41
Series - In this data is one dimensional and consist of single list with index.
3. Matplotlib :
”Matplotlib is a plotting library for the Python programming language and its numerical math-
ematics extension NumPy”[11]. Matlab provides an application that is used in graphical user
interface tool kits. Another such libraby is pylab which is almost same as MATLAB.
It is a library for 2D graphics, it finds its application in web application servers, graphical user
interface toolkit and shell.Below is the example of a basic plot in python.
4. SKLEARN :
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It
provides a selection of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction via a consistence interface in
Python. This library, which is largely written in Python, is built upon NumPy, SciPy and
Matplotlib.
What is Scikit-Learn (Sklearn)
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It
provides a selection of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction via a consistence interface in
Python. This library, which is largely written in Python, is built upon NumPy, SciPy and
Matplotlib.
5.2.3 INSTALLATION :
If you already installed NumPy and Scipy, following are the two easiest ways
to install scikit-learn −
Using pip
Following command can be used to install scikit-learn via pip −
pip install -U scikit-learn
Using conda
Following command can be used to install scikit-learn via conda −
conda install scikit-learn
On the other hand, if NumPy and Scipy is not yet installed on your Python workstation then, you
can install them by using either pip or conda.
Another option to use scikit-learn is to use Python distributions like Canopy and Anaconda
because they both ship the latest version of scikit-learn.
Features
42
Supervised Learning algorithms − Almost all the popular supervised learning
algorithms, like Linear Regression, Support Vector Machine (SVM), Decision
Tree etc., are the part of scikit-learn.
Unsupervised Learning algorithms − On the other hand, it also has all the
popular unsupervised learning algorithms from clustering, factor analysis,
PCA (Principal Component Analysis) to unsupervised neural networks.
Clustering − This model is used for grouping unlabeled data.
Cross Validation − It is used to check the accuracy of supervised models on
unseen data.
Dimensionality Reduction − It is used for reducing the number of attributes
in data which can be further used for summarisation, visualisation and
feature selection.
Ensemble methods − As name suggest, it is used for combining the
predictions of multiple supervised models.
Feature extraction − It is used to extract the features from data to define the
attributes in image and text data.
Dataset Loading
A collection of data is called dataset. It is having the following two
components −
Features − The variables of data are called its features. They are also known
as predictors, inputs or attributes.
Feature matrix − It is the collection of features, in case there are more than
one.
Feature Names − It is the list of all the names of the features.
Response − It is the output variable that basically depends upon the feature
variables. They are also known as target, label or output.
Response Vector − It is used to represent response column. Generally, we
have just one response column.
Target Names − It represent the possible values taken by a response vector.
Scikit-learn have few example datasets like iris and digits for classification and the Boston house
prices for regression
43
6.1 FEASABILITY ANALYSIS :
1. TECHNICAL FEASIBILITY
2. OPERATIONAL FEASIBILITY
3. ECONOMIC FEASIBILITY
TECHNICAL FEASIBILITY :
This phase focuses on the technical resources available to the organization. It helps
organizations determine whether the technical resources meet capacity and whether the ideas can
be converted into working system model. Technical feasibility also involves the evaluation of the
hardware, software, and other technical requirements of the proposed system.
OPERATIONAL FEASIBILITY :
This phase involves undertaking a study to analyse and determine how well the
organization’s needs can be met by completing the project. Operational feasibility study also
examines how a project plan satisfies the requirements that are needed for the phase of system
development.
ECONOMIC FEASIBILITY :
This phase typically involves a cost benefits analysis of the project and help the
organization to determine the viability, cost-benefits associated with a project before financial
resources are allocated. It also serves as an independent project assessment and enhances project
credibility. It helps the decision-makers to determine the positive economic benefits of the
organization that the proposed project will provide
44
07. SYSTEM TESTING
System testing is the stage of implementation that is aimed at ensuring that the system
works accurately and efficiently before live operation commences. Testing is vital to the success
of the system. System testing makes logical assumption that if all the parts of the system are
correct, then the goal will be successfully achieved. System testing involves user training system
testing and successful running of the developed proposed system. The user tests the developed
system and changes are made per their needs. The testing phase involves the testing of developed
system using various kinds of data. While testing, errors are noted and the corrections are made.
The corrections are also noted for the future use.
45
testing and that all logical conditions have been exercised.
When applying a model, the response variable Y can be either quantitative or qualitative.
The process for predicting qualitative responses involves assigning the observation to a
category, or class and is thus known as classification. The methods used for classification
often first predict the probability of each category of a quaitative variable. There exists
many classification techniques named classifiers, that enable to predict a qualitative
response. In this thesis, two of the most widely-used classifiers have been discussed:
logistic regression and random forests. [9]
Logistic Regression
46
When it comes to classification, the probability of an observation to be part of a certain
class or not is determined. In order to generate values between 0 and 1, we express the
probability using the logistic equation:
p(X) = exp(β 0 + β 1 X) / 1 + exp(β 0 + β 1 X)
After a bit of manipulation, we find that: p(X) = exp(β 0 + β 1 X) / 1 − p(X)
The left side of the equation is called the odds, and can take on any value
between 0 and ∞. Values of the odds close to 0 and ∞ indicate respectively
very low and very high probabilities. By taking the logarithm of both sides we
arrive at:
log(p(X) ) = β 0 + β 1 X / 1 − p(X)
There, the left side is called the log-odds or logit. This function is linear in X.
Hence, if the coefficients are positive, then an increase in X will result in a higher
probability.
The coefficients β 0 and β 1 in the logistic equation are unknown and must be estimated
based on the available training data. To fit the model, we use a method called maximum
likelihood. The basic intuition behind using maximum likelihood to fit a logistic regression
model is as follows: we seek estimates for β 0 and β 1 such that the predicted probability
p̂ (x i ) of the target for each sample corresponds as closely as possible to the sample’s
observed status. In other words, the estimates β ˆ 0
and β ˆ 1 are chosen to maximize the likelihood function:
l(β 0 , β 1 ) = Y p(X i ) / Y (1 − p(X i 0 ))
After applying logistic regression, the accuracy of the coefficient estimates can be
measured by computing their standard errors. Another performance metric is z- statistic.
For example, the z-statistic associated with β 1 is equal to β ˆ 1 /SE( β ˆ 1 ), and so a large
absolute value of the z-statistic indicates evidence against the null
hypothesis H 0 : β 1 = 0.
Decision Trees
Decision trees and random forest are tree-based methods that involve segmenting the
predictor space into several simple regions. The mean or mode of the training observations
in their own region are used to compute the prediction of a given observation. This process
47
is composed of a succession of splitting rules to segment the space which mimic the
branches of a tree and is referred to as a decision tree.
Even if tree-based methods do not compete with more advanced supervised learning
approaches, they are often preferred thanks to their simple interpretation. However,
it is possible to improve prediction accuracy by combining a large number of trees,
at the expense of some loss in interpretation. To grow a classification tree, we use what is
called the classification error rate as a criterion for making recursive binary splits. The goal
is to assign an observation in a given region to the most commonly occurring class of
training observations in that region. Therefore, the classification error rate is simply the
fraction of the training observations in that region that do not belong to the most common
class:
E = 1 − max (p̂ mk ),
where p̂ mk is the proportion of observations in the mth region that are from the
kth class. In practice, this criterion is not sensitive enough for growing the trees,
which leads us to two other measures that are usually preferred: the Gini index and
entropy. The Gini index is a measure of total variance across the K classes and is
defined by:
G = K X p̂ mk (1 − p̂ mk ).
k=1
If all the p̂ mk are close to 0 or 1, the Gini index will be small meaning that a small
value of G indicates that a node mainly contains observations from only one class,
which can be characterized as node purity. As was mentioned before, an alternative
to Gini index is entropy.
Bagging
As we mentioned in the previous section, decision trees suffer from high variance, meaning
that if we fit a decision tree to two different subsets of the training data, we might get quite
different results. Bootstrap aggregation, or bagging, is a method aiming at reducing the
variance, and therefore is commonly used when decision trees are implemented. Given a
set of n independent observations Z 1 , ..., Z n , each with variance σ 2 , the variance of the
mean Z of the observations is given by σ 2 /n. This means that averaging a set of
observations reduces variance. Hence, by taking many training sets from the population,
building a separate prediction model using each training set and averaging the resulting
48
predictions, we can reduce the variance and consequently increase the prediction accuracy
of the method. In particular, we calculate f ˆ 1 (x), f ˆ 2 (x), ..., f ˆ B (x) using B separate
training sets, and average them in order to obtain a single low-variance statistical model,
given by:
B 1 X f ˆ avg (x) = f ˆ b (x).
B b=1
However, in most use cases, it might not be possible to access multiple
training sets. That is where the bootstrap method becomes useful. Bootstrap
consists in taking repeated samples from the original training data set. It
generates B different bootstrapped training data sets. Then, the model is fit
on the both bootstrapped training set, resulting in the prediction f ˆ ∗b (x).
All the predictions are averaged to
obtain: B 1 X f ˆ bag (x) = f ˆ ∗b (x).
B b=1
Bagging can easily be applied to a classification problem, to predict a qualitative outcome
Y . For a given test observation, each B tree predict a class and we choose
the overall prediction as the most commonly occurring class among B predictions.
Random Forests
The main drawback when bagging several decision trees is that the trees are correlated.
Random forest provides a way to fix this issue by using individual trees as
building blocks. Random forest builds multiple decision trees using
bootstrapped samples from the training data, each tree having high
variance, and average these trees which reduces variance. To prevent the
correlation between the trees, it randomly √ selects a subset of variables to
use for each tree whose size is usually set to m = N , N being the total
number of features. Correlation is avoided because each
tree does not consider all variables, but only subsets of them. The problem of over fitting is
also addressed by this technique. The main disadvantage in using multiple
trees is that it lowers the interpretability of the model [12]. Gini index, presented in Section
5.1.2 can also be used there in order to measure feature importance. The depth of a feature
used for a split can also be used to indicate importance of a given
feature. That is, as the intuition confirms, features used at the top splits of a tree will
influence the final predicted observations more than features used for splits at the bottom
49
of the tree. [9]
Model Evaluation
Confusion Matrix
Given the values of these four entries, several other metrics can be derived,
namely accuracy, recall, precision and F-score.
Accuracy
Accuracy computes the number of correctly classified items out of all
classified items.
Accuracy = Recall / TP / TP + FN
Recall tells how much the model predicted correctly, out of all the positive
50
classes. It
should be as high as possible as high recall indicates the class is correctly
recognized
(small number of FN). It is usually used when the goal is to limit the number of
false negatives.
Recall = Precision / (TP /TP + FP)
Precision tells, out of all the positive classes that were predicted correctly,
how
many are actually positive. High precision indicates an example labeled as
positive
is actually positive (small number of FP). It is usually used when the goal is to
limit
the number of false positives. A model with high recall but low precision
means
that most of
the positive examples are correctly recognized (low FN) but that there
is a lot of false positive.
F-score
F-score = 2 × Recall × Precision / Recall + Precision
51
F-score is a way to represent precision and recall at the same time and is
therefore
widely used for measuring model performances. Indeed, if you try to only
optimize
recall, the algorithm will predict most examples to belong to the positive class,
but
that will result in many false positives and, hence, low precision. On the other
hand,
optimizing precision will lead the model to predict very few examples as
positive
results (the ones with highest probability), but recall will be very low. [14]
ROC Curve and AUC
AUC - ROC curve is a more visual way to measure the performance of a
classifier
at various threshold settings. ROC (Receiver Operating Characteristics) curve
is
created by plotting the recall against the false positive rate (FPR) defined by:
FPR = FP / FP + TN
52
. Figure : Example of ROC Curve
AUC (Area Under the Curve) represents degree of separability. It tells how much
the model is capable of distinguishing between classes. Higher the AUC, better the
model is at predicting 0s as 0s and 1s as 1s. The ideal ROC curve hugs the top left
corner, indicating a high true positive rate and a low false positive rate. The dotted
diagonal represents the ’no information’ classifier, this is what we would expect from
a ’random guessing’ classifier.
The precision-recall curve illustrates the trade-off between precision and recall that was
mentioned in the previous sections. As with the ROC curve, each point in the plot
corresponds to a different threshold. Threshold equal to 0 implies that the recall is 1,
whereas threshold equal to 1 implies that the recall is 0. With this curve, the close it is to
the top right corner, the better the algorithm. And hence a larger area under the curve
indicates that the algorithm has higher recall and higher precision.In this context, the area
is known as average precision.
Figure : Example
of Precision-Recall Curve
Imbalanced Data
53
The imbalanced data is referred to as a situation where one of the classes forms a high
majority and dominates other classes. This type of distribution might cause
an accuracy bias in machine learning algorithms and prevent from evaluating the
performances of the model correctly. Indeed, suppose there are two classes - A and B.
Class A is 90% of the data-set and class B is the other 10%, but it is most interesting
to identify instances of class B. Then, a model that always predict class A will be
90% of times successful in terms of basic accuracy. However, this is impractical for
the intended use case because the costs of false positive (or Type I Error) and false
negative (or Type II Error) predictions are not equal. Instead, a properly calibrated
method may achieve a lower accuracy, but would have a substantially higher true
positive rate (or recall), which is really the metric that should be optimized. There
are a couple of solutions to imbalanced data problem but only the one that were
tested during this project will be mentioned.
THEORY
Cross validation
54
Figure : Hold-out method
k-fold Cross-Validation
Cross-validation is a refinement of the train-test split approach that addresses the issue
highlighted in the previous subsection. This approach consists in randomly dividing the set
of observations into k groups, or folds, of approximately equal size.
The first fold acts as a test set, and the method is fit on the remaining k − 1
folds. The mean squared error is then computed on the observations of the
test set. The overall procedure is computed k times with each time a
different subset of observations taking the role of the test set. As a result,
the test error is estimated.
Basic Model
Logistic Regression
The following model is a basic logistic regression model fit on a training resulting
from a simple stratified train-test split. No resampling technique is implemented.
55
Confusion Matrix
Figure :
Precision,
Recall, F1 -
Logistic
Regression -
No
resampling
The
model has an accuracy of 99.55% but the confusion matrix shows that none of the true
churners have been detected.
As intuitively presumed, the model is strongly overfitting, and therefore always predicts
the negative class to reach the highest accuracy.
Random Forest
56
Random Forest is an ensemble bagging technique where several decision trees combine to
give the result. The process is a combination of bootstrapping and aggregation. The main
idea behind this is that lots of high variance and low bias trees combine to generate a low
bias and low variance random forest. Since it is distributed over different trees, each tree
seeing different subsets of data, it is less prone to overfitting than logistic regression. In this
section, the confusion matrix and precision, recall and f1 score are presented, for a random
forest model fit on data that was not resampled.
57
Because of the class imbalance, using a random forest model, even if it tends to overfit less
than a linear model, is not sufficient to prevent overfitting.
SYSTEM MAINTENANCE:
The maintenance phase of the software cycle is the time in which a software product
performs useful work. After a system is successfully implemented, it should be maintained
in a proper manner. System maintenance is an important aspect in the software
development life cycle. The need for system maintenance is to make adaptable and some
changes in the system environment. There may be social, technical and other
environmental changes, which affect a system, that is implemented. Software product
enhancements may involve providing new functional capabilities, improving user displays
and mode of interaction, upgrading the performance of the characteristics of the system.
Maintenance phase identifies if there are any changes required in the current system. If the
changes are identified, then an analysis is made to identify if the changes are really
required. Cost benefit analysis is a way to find out if the change is essential. System
maintenance conforms the system to its original requirements and the purpose is to
preserve the value of software over the time. The value can be enhanced by expanding the
customer base, meeting additional requirements, becoming easier to use, more efficient
and employing newer technology.
In conclusion, the application of machine learning in Submarine Rock vs. Mine Prediction
represents a pivotal advancement in the field of maritime security and underwater defense. The
ability to accurately distinguish between natural submarine rock formations and potentially
hazardous naval mines is of paramount importance, and machine learning offers an innovative
58
demonstrate the capability to analyze and interpret diverse underwater data sources,
systems have the potential to drastically reduce false alarms, minimizing unnecessary
naval vessels equipped with sonar systems enables real-time threat assessment,
not only enhances national security but also minimizes the risk of unintended ecological
In the pursuit of Submarine Rock vs. Mine Prediction using Machine Learning, the collaboration
of defense organizations, research institutions, and industry partners is crucial. This collaborative
effort helps to pool resources, expertise, and data to advance the field and address the evolving
While this technology holds immense promise, it is important to address concerns related to data
security, privacy, and ethical considerations, particularly in military and defense applications.
In conclusion, the integration of machine learning in Submarine Rock vs. Mine Prediction is a
transformative step that contributes to safer and more secure maritime environments, both in
systems are poised to become increasingly accurate, efficient, and indispensable in safeguarding
59
09. BIBLIOGRAPHY
This review discusses how machine learning is applied in geology, including rock classification
and predictive models for mineral exploration.
This paper explores various machine learning techniques for interpreting geophysical data in
mineral and oil exploration, which might be useful for distinguishing submarine rocks and
mining sites.
∙ Wang, F., & Zhang, J. (2018). Application of machine learning in the prediction of mineral
deposits. Ore Geology Reviews, 101, 335-350.
Focuses on how machine learning algorithms can be used to predict the location of mineral
deposits, an important aspect of submarine mining and rock prediction.
∙ Yuan, H., & Zhao, M. (2020). An overview of machine learning methods for geophysical
and geological data interpretation. Computers & Geosciences, 137, 104362.
Explores different machine learning techniques for interpreting geophysical data, which could
be applied to distinguish between submarine rocks and mine locations.
∙ Zhou, Y., & Li, L. (2022). A study on underwater mine identification based on deep
learning. Journal of Marine Science and Engineering, 10(3), 392.
This paper explores how deep learning can be used to predict the locations of underwater
mines and identify submarine rock formations in marine environments.
∙ Sánchez-Rodríguez, C., & Peña, F. (2021). Machine learning for the exploration and
extraction of deep-sea mineral resources. Minerals, 11(11), 1245.
Discusses the use of machine learning to identify and predict mineral deposits in the deep-sea
environment, which can be useful for distinguishing submarine rock formations from mines.
∙ Liu, Y., & Li, S. (2017). Mining subsurface data with machine learning: A review.
Computational Geosciences, 21(1), 29-46.
This review addresses how machine learning is applied to mining, including subsurface data
analysis and prediction tasks related to mines and geological features.
∙ Lee, J., & Kim, H. (2020). Automated rock type classification in the marine environment
using machine learning algorithms. Journal of Hydrology, 585, 124739.
60
Explores how machine learning can be used to classify rock types in underwater
environments, which is highly relevant for distinguishing submarine rocks.
∙ Barton, C., & Bell, D. (2015). Deep learning for underwater robotics: A case study on rock
recognition. In 2015 IEEE International Conference on Robotics and Automation (ICRA), 4591-
4597.
Focuses on underwater robotics and how deep learning algorithms can be applied to
recognize underwater geological features like rocks.
∙ Niu, Z., & Wei, X. (2019). Application of machine learning in predictive maintenance of
mining equipment: A case study. Journal of Mining Science, 55(5), 945-952.
Although focused on predictive maintenance, this study shows how machine learning can be
used in the mining sector, which may be applicable in submarine mining as we
61
# Step 2: Apply KMeans Clustering for Customer Segmentation
n_clusters = 3 # Choose the number of clusters
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
data['Cluster'] = kmeans.fit_predict(scaled_data) # Add cluster labels to the data
# Scale the new data based on the same scaling used before
new_customer_scaled = scaler.transform(new_customer_data)
62
10.1 OUTPUT :
63