Breast Cancer Aiml Project
Breast Cancer Aiml Project
An
AIML Course Project Report
in partial fulfilment of the degree
Bachelor of Technology
in
Computer Science & Engineering
M.Anil 2203A52239
B.Vishal 2203A52222
.Hrudai 2203A51191
Submitted to
This is to certify that the AIML - Course Project Report entitled “Breast Cancer
Prediction” is a record of bonafide work carried out by the student M.Anil,B.Vishal and .Hrudai
bearing Roll No 2203A522391,2203A52222 and 2203A5 during the academic year 2023-2024 in
partial fulfillment of the award of the degree of Bachelor of Technology in Computer Science &
We express our thanks to course coordinator Mr. Dr.Eranki Kiran , Prof. for guiding
us from the beginning through the end of the course project. We express our gratitude to head
of the department CS&AI, Dr. M. Sheshikala, Associate Professor for encouragement,
support and insightful suggestions. We truly value their consistent feedback on our progress,
which was always constructive and encouraging and ultimately drove us to the right
direction.
We wish to take this opportunity to express our sincere gratitude and deep sense of
respect to our beloved Dean, School of Computer Science and Artificial Intelligence, Dr C.
V. Guru Rao, for his continuous support and guidance to complete this project in the institute.
Finally, we express our thank to all teaching and non-teaching staff of the department
for their suggestions and timely support.
ABSTRACT
Breast cancer is one of the most prevalent forms of cancer worldwide, affecting millions of
women each year. Early detection plays a crucial role in improving survival rates and
treatment outcomes. In recent years, machine learning techniques have emerged as powerful
tools for breast cancer prediction, offering the potential to enhance screening accuracy and
efficiency. This review paper provides an overview of the various machine learning
algorithms employed for breast cancer prediction, including logistic regression, support
vector machines, decision trees, random forests, artificial neural networks, and deep learning
models. We discuss the features and datasets commonly used in these algorithms and
highlight the strengths and limitations of each approach. Additionally, we examine the
challenges associated with breast cancer prediction, such as data imbalance, feature
selection, and interpretability, and propose potential solutions. Furthermore, we explore
recent advancements in the field, such as the integration of genomic and imaging data, and
the application of transfer learning and ensemble methods. By synthesizing existing research
findings, this review aims to provide insights into the current state-of-the-art in breast cancer
prediction using machine learning and identify future directions for research and
development in this critical domain.
Table of Contents
Breast cancer continues to pose a significant public health challenge globally, being the most
common cancer among women and a leading cause of cancer-related mortality. Despite
advances in screening programs and treatment modalities, early detection remains paramount
for improving patient outcomes. Traditional methods of breast cancer screening, such as
mammography, while effective, have limitations in terms of accuracy and accessibility. In
recent years, the emergence of machine learning techniques has opened up new possibilities
for enhancing breast cancer prediction and early detection.
In the context of breast cancer prediction, machine learning offers several advantages over
conventional methods. By leveraging diverse datasets encompassing clinical, demographic,
genetic, and imaging information, machine learning algorithms can potentially improve the
accuracy and efficiency of breast cancer screening. Moreover, these algorithms can adapt and
evolve over time, learning from new data to continually refine their predictive capabilities.
Support vector machines (SVMs) have also been employed for breast cancer prediction,
leveraging their ability to construct optimal decision boundaries in high-dimensional
feature spaces. SVM-based classifiers have been trained on diverse datasets comprising
genetic, proteomic, and imaging data to distinguish between malignant and benign
breast lesions with high accuracy.
Decision tree algorithms, such as CART (Classification and Regression Trees) and
Random Forests, have shown promise in breast cancer risk stratification. These
algorithms recursively partition feature space based on attribute values, enabling the
identification of predictive biomarkers and risk factors associated with breast cancer
development. Random Forests, in particular, have gained popularity due to their
robustness to overfitting and ability to handle missing data.
Artificial neural networks (ANNs) and deep learning models have emerged as powerful
tools for breast cancer prediction, leveraging their capacity to learn intricate patterns
from complex datasets. Convolutional neural networks (CNNs) have been applied to
analyze medical images, such as mammograms and magnetic resonance imaging (MRI),
to detect and classify breast lesions with high sensitivity and specificity.
Recent advancements in the field include the integration of genomic and imaging data
to improve prediction accuracy, as well as the application of transfer learning and
ensemble methods to leverage pre-trained models and enhance generalization
performance. Furthermore, the emergence of federated learning approaches holds
promise for collaborative model development across multiple healthcare institutions
while preserving data privacy and security.
III)Problem Statement
Current breast cancer screening methods, such as mammography, have limitations in
accuracy, leading to false positives or negatives. Machine learning (ML) techniques hold
promise for improving prediction accuracy, but face key challenges:
Data Imbalance: Breast cancer datasets often have more benign cases than malignant, leading
to biased models.
Feature Selection: Identifying relevant biomarkers from high-dimensional datasets is crucial
for accurate prediction.
Interpretability: Complex ML models lack transparency, hindering trust and understanding
from clinicians.
Generalization: ML models need to generalize well across diverse patient populations and
healthcare settings.
A) DataSet:
https://www.kaggle.com/datasets/nancyalaswad90/breast-cancer-dataset
The dataset used for breast cancer prediction typically comprises a collection of clinical,
demographic, genetic, and imaging data from individuals diagnosed with or at risk of
breast cancer. These datasets often include features such as patient age, family history of
cancer, genetic mutations (e.g., BRCA1/BRCA2), tumor characteristics (e.g., size, grade),
histological findings, and imaging results (e.g., mammography, MRI).
B) Data Pre-Processing
C) Handling Missing Values: Imputing missing values using methods like mean, median,
or KNN imputation.
D) Feature Scaling: Rescaling features to a common scale using techniques like min-max
scaling or standardization.
E) Feature Encoding: Converting categorical features into numerical values through
methods like one-hot encoding or label encoding.
F) Feature Selection: Identifying relevant features and reducing dimensionality using
techniques such as RFE or feature importance ranking.
G) Data Balancing: Addressing imbalanced class distributions using techniques like
random oversampling or SMOTE.
H) Train-Test Split: Dividing the dataset into training and testing subsets for model
evaluation, often employing k-fold cross-validatio
I) Algorithms
In this Project I’m Using eight Models they are shown in the below:
a. Exploratory Data Analysis & visualization
b. Linear Regression
c. Logistic Regression
d. Polynomial Regression
e. KNN Classification
f. CNN Model
g. Support Vector Machine
h. K means Clustering
Exploratory Data Analysis (EDA) & Visualization:
Logistic Regression:
Despite its name, logistic regression is a classification algorithm used for binary
classification tasks.
Logistic regression models the probability that a given input belongs to a particular class.
It uses the logistic function (sigmoid function) to map predicted values to probabilities
between 0 and 1.
The logistic regression model estimates the coefficients (weights) for each feature using
maximum likelihood estimation.
It's widely used in various fields such as medicine, finance, and social sciences for binary
classification tasks.
Polynomial Regression:
Polynomial regression extends linear regression by allowing the relationship between the
independent and dependent variables to be modeled as an nth degree polynomial.
The polynomial regression equation has the form: y = β0 + β1x + β2x^2 + ... + βn*x^n.
It can capture nonlinear relationships between variables better than linear regression.
However, polynomial regression can suffer from overfitting, especially with higher
degrees of polynomials, and it may not generalize well to unseen data.
K-Nearest Neighbors (KNN) Classification:
KNN is a non-parametric, lazy learning algorithm used for classification and regression
tasks.
Classification using KNN involves assigning the majority class label among the k nearest
neighbors of a data point.
KNN's decision boundary is determined by the training data distribution.
It's computationally expensive during prediction as it requires calculating distances
between the query point and all training points.
KNN's performance can be sensitive to the choice of k and the distance metric used.
Convolutional Neural Network (CNN):
CNN is a deep learning algorithm commonly used for image recognition, classification,
and computer vision tasks.
It's inspired by the visual cortex's organization and uses convolutional layers to
automatically extract features from input images.
CNNs consist of convolutional layers, pooling layers, and fully connected layers.
Convolutional layers apply convolutional filters (kernels) to input images to extract
spatial hierarchies of features.
Pooling layers downsample the feature maps to reduce dimensionality and computation.
CNNs are trained using backpropagation with techniques like gradient descent and can
learn hierarchical representations of features directly from pixel values
Support Vector Machine (SVM):
SVM is a supervised learning algorithm used for classification and regression tasks.
It works by finding the hyperplane that best separates the classes in the feature space.
SVM aims to maximize the margin between the classes, which helps improve
generalization to unseen data and reduce overfitting.
SVM can use different kernel functions (e.g., linear, polynomial, radial basis function) to
handle nonlinear decision boundaries.
It's effective for high-dimensional data and works well with small to medium-sized
datasets.
K-Means Clustering:
K-Means is an unsupervised learning algorithm used for clustering data into k clusters
based on similarity.
It iteratively partitions data points into clusters by minimizing the within-cluster variance
(sum of squared distances from each point to the centroid).
The algorithm initializes cluster centroids randomly and assigns each data point to the
nearest centroid.
It then updates the centroids based on the mean of the data points in each cluster and
repeats the process until convergence.
K-Means clustering may converge to local optima depending on the initial centroid
positions and is sensitive to outliers.
Comparative Study
Algorithm Accuracy Rate
Linear Regression 0.85
Logistic Regression 0.78
Polynomial Regression 0.82
KNN Classification 0.75
SVM Classification 0.80
CNN Classification 0.88
K-Means Clusters -
Conclusion:
In conclusion, breast cancer prediction is a complex yet critical task in healthcare, aiming to
improve early detection and treatment outcomes for individuals at risk of this disease.
Machine learning techniques offer promising solutions to enhance prediction accuracy and
efficiency, leveraging diverse datasets encompassing clinical, demographic, genetic, and
imaging information.
Through this review, we have explored the various aspects of breast cancer prediction using
machine learning, including dataset characteristics, data pre-processing techniques,
exploratory data analysis (EDA), and model development. We have highlighted the
challenges and opportunities associated with this domain, such as data imbalance, feature
selection, interpretability, and generalization.
Addressing these challenges requires interdisciplinary collaboration and innovative
approaches to develop robust and interpretable machine learning models for breast cancer
prediction. By leveraging advanced algorithms, integrating diverse data sources, and
employing rigorous evaluation methodologies, we can advance the field of breast cancer
prediction and contribute to improved patient outcomes.
Moving forward, future research efforts should focus on validating machine learning models
in diverse patient populations, integrating them into clinical workflows, and translating
research findings into real-world applications. By harnessing the power of machine learning,
we can make significant strides towards early detection, personalized treatment, and
ultimately, the prevention of breast cancer.
Future Scope:
Integration of Multi-Omics Data: Incorporating diverse molecular data types to
improve predictive accuracy and uncover novel biomarkers.
Development of Explainable AI Models: Creating transparent models to gain trust
from clinicians and patients.
Incorporation of Real-Time Data: Leveraging continuous monitoring for early
detection and proactive interventions.
Personalized Risk Assessment: Tailoring screening and treatment approaches based
on individual characteristics.
Collaboration and Data Sharing: Promoting partnerships and open-access datasets
to accelerate progress.
Ethical and Regulatory Considerations: Addressing privacy, fairness, and
accountability in algorithm development and deployment.
.
Reference Links:
1) https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019.
2) https://machinelearningknowledge.ai/mini-ml-project-predicting-song-likeness-from-
spotify-playlist/
3) https://blog.devgenius.io/mini-ml-project-predicting-spotify-songs-popularity-part-2-
1c8f501a109a
4) https://arxiv.org/pdf/2301.07978.pdf
5) https://github.com/ebtezcan/Spotify-Song-Popularity-Prediction