Lab 14
Lab 14
Understanding practical issues such as class The dataset is divided into two subsets:
imbalance, lack of labeled data, and ethical
considerations (e.g., data privacy and bias).
Training Set: Used to train the machine learning
classifiers.
f) Impact on Personalized Medicine Testing Set: Used to evaluate the model's
performance. Typically, 80% of the data is used for
Machine learning’s role in tailoring diagnostics and training, and 20% is used for testing. The split
treatment plans to individual patients based on ensures that the model is evaluated on data it has
imaging biomarkers and classifier predictions. never seen before, simulating real-world scenarios.
The importance of ensuring model explainability, Medical data often involves attributes on different scales (e.g.,
transparency, and compliance with medical image intensity values vs. feature size). Standardization
regulations to foster trust and adoption in clinical ensures that all features have a mean of 0 and a standard
practice. deviation of 1. This step is crucial for algorithms like SVM
and k-NN that are sensitive to feature scaling.
II. MATERIAL & METHOD
Step 5: Train the k-Nearest Neighbors (k-NN) Classifier
A. Material:
The k-NN algorithm is a simple yet effective method where:
Google Colab is a cloud-based platform that provides a The classifier identifies the 'k' nearest data points in
Jupyter notebook environment for writing and executing the training set to a new test point.
Python code. It is popular for data analysis, machine learning, It assigns the most common label among the
and medical image processing due to its free access to neighbors to the test point.
computational resources like GPUs and TPUs. Colab
simplifies the process of working on Python projects,
especially when dealing with large datasets or The model is trained on the standardized training data and
computationally intensive tasks. Its integration with Google then used to make predictions on the test set. The choice of 'k'
Drive allows seamless storage and sharing of data, making it a (e.g., 5) determines the number of neighbors considered.
valuable tool for collaborative research and development.
Step 6: Train the Support Vector Machine (SVM)
B. METHOD: Classifier
Step 1: Import Libraries SVM is a more sophisticated algorithm that works by finding
the optimal hyperplane to separate classes in the feature space:
This step involves bringing in essential Python libraries that
enable various functionalities: With a linear kernel, the model assumes that the
classes can be separated by a straight line or
hyperplane.
Numpy: Handles numerical operations and data The algorithm finds the hyperplane that maximizes
arrays. the margin (distance) between classes, making it
Matplotlib: Used for visualizations, such as plotting robust to outliers.
results.
Scikit-learn: Provides tools for dataset loading,
The SVM model is trained and tested similarly to the k-NN
preprocessing, model building, and evaluation.
model.
Step 7: Train the Random Forest (RF) Classifier o Accuracy: Moderate to high, depending on
the dataset structure.
Random Forest is an ensemble learning method: o Confusion Matrix: Indicates how well the
classifier distinguished between benign and
malignant cases.
It constructs multiple decision trees during training
o Classification Report: Precision and recall
and combines their predictions (majority voting for
may vary, especially if the classes are
classification) to improve accuracy and prevent
imbalanced.
overfitting.
Each tree is trained on a random subset of data,
making the model robust and less prone to bias from
individual features.
1. Use the trained model to predict the labels of the test 2. Support Vector Machine (SVM)
set.
2. Compare the predicted labels with the actual labels in Strengths: Finds an optimal hyperplane to maximize
the test set to evaluate performance. the margin between classes, ensuring robustness and
generalization.
Step 9: Evaluate Performance Weaknesses: May struggle with very large datasets
or overlapping classes without kernel tuning.
The performance of each classifier is evaluated using: Results:
o Accuracy: Generally high, especially with a
linear kernel for linearly separable data like
1. Accuracy: Measures the proportion of correct the breast cancer dataset.
predictions. o Confusion Matrix: Often exhibits minimal
2. Confusion Matrix: Provides insights into true false positives and false negatives,
positives, true negatives, false positives, and false indicating reliable classification.
negatives. o Classification Report: Precision and recall
3. Classification Report: Includes metrics like are typically balanced, showing good
precision, recall, and F1-score for each class. overall performance.
V. REFERENCES
IV. CONCLUSIONS