0% found this document useful (0 votes)
16 views13 pages

Breast Cancer Diagnosis Using Machine Learning Alg

The document describes a breast cancer diagnosis dataset containing patient information like age, tumor characteristics, and genetic markers. It then discusses code to preprocess, analyze, and visualize the data using machine learning algorithms to build models that can accurately diagnose breast cancer.

Uploaded by

Azmeraw Zenaw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views13 pages

Breast Cancer Diagnosis Using Machine Learning Alg

The document describes a breast cancer diagnosis dataset containing patient information like age, tumor characteristics, and genetic markers. It then discusses code to preprocess, analyze, and visualize the data using machine learning algorithms to build models that can accurately diagnose breast cancer.

Uploaded by

Azmeraw Zenaw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

ARBAMINCH UNIVERSITY

FACULITY OF COMPUTING AND SOFTWARE ENGINEERING


DEPARTMENT OF INFORMATION TECHNOLOGY

Artificial Intelligence assignment


Group members ID No

1Amanuel Asale…………………………………………………NSR/2899/13
2 Abenezer Asefa………………………………………………..NSR/101/13
3 Dagim Syum…………………………………………………...NSR/694/13
4 Fitsum Eerena ………………………………………………..NSR/1045/13
5 Ermiyas G/Hiwot………………………………………………NSR/2972/13
6 Abel Tesema……………………………………………………NSR/081/13
7 Nihal Mussa…………………………………………………….NSR/1910/13
8 Amira Neri……………………………………………………..NSR/291/13
9 Eden Yazachew…………………………………………………NSR/802/13
10 Mulusew Aynalem……………………………………………NSR/1809/13
Breast Cancer Diagnosis using Machine
Learning Algorithms
Introduction
Breast cancer remains a formidable health challenge worldwide, constituting a
significant cause of morbidity and mortality, particularly among women. Timely and
accurate diagnosis is paramount for effective treatment planning and improving
patient outcomes. The integration of machine learning (ML) algorithms into breast
cancer diagnosis offers a promising approach to harnessing complex patient data for
enhanced prognostication and therapeutic decision-making.

Problem Definition

The core objective revolves around the development of robust ML models capable of
accurately distinguishing between benign and malignant breast tumors based on
multifaceted clinical and pathological features. By leveraging diverse datasets
encompassing patient demographics, tumor characteristics, histopathological findings,
and genetic markers, the aim is to empower clinicians with sophisticated tools for risk
stratification and treatment guidance.

Dataset Description

Age:

Description: Age of the patient at the time of diagnosis.

Data Type: Continuous numerical.

Race:

Description: Ethnicity or racial background of the patient.

Data Type: Categorical (e.g., White, Black, Asian, Hispanic, etc.).

Marital Status:

Description: Marital status of the patient at the time of diagnosis.

Data Type: Categorical (e.g., Single, Married, Divorced, Widowed, etc.).

T_Stage:

Description: Tumor stage, indicating the size and extent of the primary tumor.

Data Type: Categorical or ordinal (e.g., T1, T2, T3, T4).

N Stage:

1
Description: Lymph node stage, indicating the extent of regional lymph node
involvement.

Data Type: Categorical or ordinal (e.g., N0, N1, N2, N3).

6th Stage:

Description: Cancer stage according to the 6th edition of the TNM staging system,
which incorporates tumor size (T), lymph node status (N), and metastasis (M).

Data Type: Categorical or ordinal (e.g., Stage I, Stage II, Stage III, Stage IV).

Differentiate:

Description: Histological grade or degree of tumor differentiation, indicating how


closely the tumor resembles normal tissue.

Data Type: Categorical or ordinal (e.g., Well-differentiated, Moderately-


differentiated, Poorly-differentiated).

Grade:

Description: Histological grade of the tumor, reflecting the aggressiveness and


abnormality of tumor cells.

Data Type: Categorical or ordinal (e.g., Grade 1, Grade 2, Grade 3).

A Stage:

Description: Cancer stage according to the American Joint Committee on Cancer


(AJCC) staging system, incorporating tumor size, lymph node status, and metastasis.

Data Type: Categorical or ordinal (e.g., Stage I, Stage II, Stage III, Stage IV).

Tumor Size:

Description: Size of the primary tumor, typically measured in millimeters.

Data Type: Continuous numerical.

Estrogen Status:

Description: Estrogen receptor (ER) status of the tumor, indicating whether the tumor
cells have receptors for estrogen hormone.

2
Data Type: Categorical (e.g., Positive, Negative, Unknown).

Progesterone Status:

Description: Progesterone receptor (PR) status of the tumor, indicating whether the
tumor cells have receptors for progesterone hormone.

Data Type: Categorical (e.g., Positive, Negative, Unknown).

Regional Node Examined:

Description: Number of regional lymph nodes examined during surgery or biopsy.

Data Type: Continuous numerical.

Regional Node Positive:

Description: Number of regional lymph nodes positive for cancer cells.

Data Type: Continuous numerical.

Survival Months:

Description: Duration of survival in months following the diagnosis of breast cancer.

Data Type: Continuous numerical.

Status:

Description: Survival status of the patient at the end of the observation period.

Data Type: Categorical (e.g., Alive, Dead).

Code Explanation

Importing libraries

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
import seaborn as sns
import warnings
warnings.simplefilter("ignore")

In the above code we have imported the necessary libraries including

numpy : used for numerical computing in Python.

Pandas : used for data manipulation and analysis in Python.

3
matplotlib : used for creating static, animated, and interactive visualizations in
Python

sklearn.model_selection import train_test_split: Imports the train_test_split


function from the model_selection module in Scikit-learn, which is used for splitting
data into training and testing sets for machine learning models.

seaborn: Imports the Seaborn library, which is another data visualization library built
on top of Matplotlib. Seaborn provides a high-level interface for drawing attractive
and informative statistical graphics.

import warnings: Imports the warnings module, which is used to handle warning
messages in Python.

data_f=pd.read_csv("Breast_Cancer.csv")
data_f.tail(10)
pd.read_csv("Breast_Cancer.csv"): This function call reads the CSV file named
"Breast_Cancer.csv" into a pandas DataFrame. The read_csv function is a part of the
pandas library (pd). It reads the CSV file and converts it into a DataFrame, which is a
tabular data structure in pandas.

data_f.tail(10): Once the CSV file is read into the DataFrame data_f, the .tail(10)
method is called on the DataFrame. This method returns the last 10 rows of the
DataFrame. It's a way to quickly inspect the end of the dataset and see the most recent
entries.

Here is the output

data_f['Status'].value_counts()
The code data_f['Status'].value_counts() simply counts the occurrences of each unique
value in the 'Status' column of the DataFrame data_f, providing a summary of the
distribution of different statuses in the dataset.

4
Here is the output

Status
Alive 3408
Dead 616
Name: count, dtype: int64

data_f.dtypes
The code data_f.dtypes retrieves the data types of each column in the DataFrame
data_f. It returns a Series where the index contains the column names and the values
contain the corresponding data types of each column. This helps in understanding the
data types of different variables in the dataset, which is essential for data manipulation
and analysis.
Here is the output
Age int64
Race object
Marital Status object
T_Stage object
N Stage object
6th Stage object
differentiate object
Grade object
A Stage object
Tumor Size int64
Estrogen Status object
Progesterone Status object
Regional Node Examined int64
Reginol Node Positive int64
Survival Months int64
Status object
dtype: object

data_f.info()
The data_f.info() function provides a concise summary of the DataFrame data_f,
including information about the index dtype and column dtypes, non-null values, and
memory usage. This method is useful for quickly understanding the structure of the
DataFrame, the number of non-null values in each column, and the memory usage,
which can be helpful for data cleaning and optimization.

Here is the output


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4024 entries, 0 to 4023
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 4024 non-null int64
1 Race 4024 non-null object
2 Marital Status 4024 non-null object
3 T_Stage 4024 non-null object
4 N Stage 4024 non-null object

5
5 6th Stage 4024 non-null object
6 differentiate 4024 non-null object
7 Grade 4024 non-null object
8 A Stage 4024 non-null object
9 Tumor Size 4024 non-null int64
10 Estrogen Status 4024 non-null object
11 Progesterone Status 4024 non-null object
12 Regional Node Examined 4024 non-null int64
13 Reginol Node Positive 4024 non-null int64
14 Survival Months 4024 non-null int64
15 Status 4024 non-null object
dtypes: int64(5), object(11)
memory usage: 503.1+ KB

The code data_f.isnull().sum() quickly calculates the total number of missing values
in each column of the DataFrame data_f.

In short, data_f represents a DataFrame containing your dataset. Printing or displaying


data_f will show the entire DataFrame, including its rows and columns, allowing you
to visually inspect the data.

6
Visualizing

7
The above code splits the dataset x and target y into training and testing sets using
train_test_split. Then, it fits a logistic regression model (model1) to the training data
and makes predictions on the test data. The predicted labels are stored in y_pred.

 The following code and output preprocesses a dataset x containing non-numeric


data by:
Identifying non-numeric columns.
Label encoding non-numeric columns using LabelEncoder.
Scaling the entire dataset using MinMaxScaler. This prepares the data for use with
machine learning algorithms that require numeric input.
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder

# Assuming x is your dataset and contains non-numeric data


# Let's assume 'x' is a DataFrame

# Identify non-numeric columns


non_numeric_columns = x.select_dtypes(exclude=['number']).columns

# Label encode non-numeric columns


label_encoders = {}
for col in non_numeric_columns:
label_encoders[col] = LabelEncoder()
x[col] = label_encoders[col].fit_transform(x[col])

# Now, all columns should be numeric or contain numeric data

8
# You can apply MinMaxScaler
scaler = MinMaxScaler()
x_encod = scaler.fit_transform(x)

 The following code evaluates the performance of a classification model (e.g.,


logistic regression) by calculating and printing various metrics:
 It computes and prints the confusion matrix, summarizing the model's
predictions.
 It prints a classification report, including precision, recall, F1-score, and support
for each class.
 It calculates and prints the accuracy of the model.
 It calculates and prints the misclassification error.
from sklearn.metrics import confusion_matrix,accuracy_score, classification_report
conf= confusion_matrix(y_test,y_pred)
print(classification_report(y_test, y_pred))
print(conf)

accuracy = accuracy_score(y_test,y_pred,)
print('accuracy of LR IS: {:.2f}%'.format(accuracy*100))

# Calculate misclassification error


misclassification_error = 1-accuracy
print('MCE of LR IS: {:.2f}%'.format(misclassification_error*100))
Output:

The following code evaluates the performance of a logistic regression model using the
following steps:
 It computes the confusion matrix, providing a summary of the model's
predictions.
 It prints a classification report, including precision, recall, F1-score, and support
for each class.
 It calculates the accuracy of the model.
 It computes the misclassification error as 1 minus the accuracy.

9
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
modelsvm=SVC(kernel='linear', random_state=0)# Initialize the SVM classifier
modelsvm.fit(x_train,y_train)#Fit the SVM model to the training data
y_pred=modelsvm.predict(x_test)
print(y_pred)
confusion_matrix(y_test,y_pred)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

accuracysvm = accuracy_score(y_test,y_pred,)
print('accuracy of SVM IS:{:.2f}%'.format(accuracysvm*100))
# Calculate misclassification error
misclassification_error = 1-accuracysvm
print('MCE of SVM IS: {:.2f}%'.format(misclassification_error*100))

Output:

The following code and output trains a Gaussian Naive Bayes classifier
(NBclassifier1) on training data (x_train, y_train) and evaluates its performance on
test data (x_test, y_test) using the following steps:
 It initializes and trains the Gaussian Naive Bayes classifier (NBclassifier1) using
the training data.
 It predicts the class labels for the test data using the trained classifier and stores
the predictions in y_pred.
 It prints a classification report, which includes precision, recall, F1-score, and
support for each class, based on the actual and predicted labels (y_test, y_pred).
 It prints the confusion matrix, providing a summary of the model's predictions.
 It calculates and prints the accuracy of the Naive Bayes model.

10
 It calculates and prints the misclassification error.

 The following code and output trains a Support Vector Machine (SVM) classifier
with a linear kernel on training data, makes predictions on test data, and evaluates
its performance using the following steps:
 It initializes and trains the SVM classifier with a linear kernel on the training
data.
 It predicts the class labels for the test data using the trained SVM model.
 It prints the predicted class labels.
 It prints a classification report, confusion matrix, accuracy, and misclassification
error to evaluate the performance of the SVM model on the test data.

11
12

You might also like