CH 15
CH 15
Machine Learning
Prediction Capabilities
ML is widely used to make accurate predictions in various do-
mains:
⋄ Weather Forecasting: Improves accuracy to minimize
damage and save lives.
⋄ Healthcare: Enhances cancer diagnosis and treatment.
⋄ Business Forecasting: Helps in maximizing profits and
securing jobs.
⋄ Fraud Detection: Identifies fraudulent credit card trans-
actions and insurance claims.
⋄ Customer Churn Prediction: Anticipates customer re-
tention and business growth.
⋄ Real Estate Pricing: Predicts house prices based on mar-
ket trends.
⋄ Entertainment & Sports: Forecasts movie ticket sales
and game-winning strategies.
Machine Learning Applications
Some key applications of ML include:
⋄ Anomaly Detection: Identifying unusual patterns in data
⋄ Chatbots: Automated customer support
⋄ Email Classification: Spam detection
⋄ News Classification: Categorizing articles (sports, politics, etc.)
⋄ Computer Vision: Image recognition and classification
⋄ Fraud Detection: Identifying credit card and insurance fraud
⋄ Customer Churn Prediction: Detecting potential customer dropouts
⋄ Data Mining: Extracting insights from social media
⋄ Object Detection: Identifying objects in images and videos
⋄ Pattern Recognition: Finding trends in data
⋄ Medical Diagnostics: Assisting in disease detection
⋄ Facial Recognition: Identity verification
⋄ Network Intrusion Detection: Preventing cyber threats
⋄ Handwriting Recognition: Digitizing handwritten text
⋄ Marketing Analytics: Customer segmentation for targeted ads
⋄ Language Translation: Translating text between languages
⋄ Mortgage Loan Prediction: Assessing loan default risk
Scikit-learn
• The Iris Dataset: The Iris dataset is a famous dataset in machine learning
that contains 150 samples of iris flowers, divided into three species:
• Setosa
• Versicolor
• Virginica
• Each sample has four features:
1. Sepal length
2. Sepal width
3. Petal length
4. Petal width
• Since K-Means is an unsupervised method, it does not use the labels
during clustering. Instead, it attempts to group the data into three
natural clusters based on feature similarity.
Dimensionality Reduction using PCA
• Since the Iris dataset has four dimensions (features), it is difficult to visu-
alize.
• To simplify visualization, we use Principal Component Analysis (PCA),
a dimensionality reduction technique.
• PCA reduces the four features to two principal components while pre-
serving most of the variance in the data.
• This allows us to plot the data in 2D and observe how K-Means clusters
the samples.
digits = load_digits ()
OUTPUT :
[(9 , 7) , (7 , 2) , (9 , 5) ]
3
15.3 Case Study: Classification with k-Nearest Neighbors
and the Digits Dataset, Part 2
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 1
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 2
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 3
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 4
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 5
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 6
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 7
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 8
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 9
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 10
• Choosing the Best Model: It’s hard to determine the best machine learn-
ing model in advance.
• Model Performance: Some models may perform better than others on a
given dataset.
• Scikit-learn’s Flexibility: Provides multiple models for quick training and
testing.
• Encouragement to Experiment: Running multiple models helps find the
best one.
• Comparing Models: Evaluating KNeighborsClassifier, SVC, and
GaussianNB.
• Ease of Testing: Scikit-learn allows easy testing of models with default
settings.
• Let’s use the techniques from the preceding section to compare several
classification estimators—KNeighborsClassifier, SVC and GaussianNB
(there are more).
from sklearn . svm import SVC
from sklearn . naive_bayes import GaussianNB
• Based on the results, it appears that we can get better accuracy from
the KNeibhorsClassifier and SVC estimators—at least when using the
estimator’s default settings.
k-Nearest Neighbors (kNN) and Hyperparameter Tuning
• Hyperparameters are parameters set before training a model. In the k-
nearest neighbors (kNN) algorithm, k is a hyperparameter that determines
the number of nearest neighbors used for classification.
• The best value of k is determined through hyperparameter tuning. Testing
different values and evaluating their performance helps in selecting the
most optimal k.
• A common approach for evaluating different values of k is k-fold cross-
validation. In this process, the dataset is divided into k subsets, and the
model is trained and tested multiple times using different subsets.
• In practice, odd values of k are preferred to avoid ties. For the Digits
dataset, the highest accuracy (98.83%) was observed when k = 1, and
accuracy tended to decrease as k increased.
• A higher value of k smoothens decision boundaries, making the model less
sensitive to noise but potentially reducing accuracy.
• The computational cost of kNN increases with higher k, as more distances
need to be calculated to find the nearest neighbors. Efficient data handling
and computational resources are necessary for large datasets.
• The cross validate function can be used to perform cross-validation
while also measuring execution time, providing insights into both accuracy
and computational efficiency.
from sklearn . neighbors import K N e i g h b o r s C l a s s i f i e r
for k in range (1 ,20 ,2) :
kfold = KFold ( n_splits =10 , random_state =1 , shuffle = True )
knn = K N e i g h b o r s C l a s s i f i e r ( n_neighbors = k )
scores = cross_val_score ( estimator = knn , X = digits . data , y = digits . target , cv =
kfold )
print ( f ’ Mean Accuracy : { scores . mean () :.2%} Accuracy SD : { scores . std () :.2%} ’)
# ###################################################################
OUTPUT :
Mean Accuracy : 98.72% Accuracy SD : 0.70%
Mean Accuracy : 98.83% Accuracy SD : 0.80%
Mean Accuracy : 98.78% Accuracy SD : 0.82%
Mean Accuracy : 98.50% Accuracy SD : 0.86%
Mean Accuracy : 98.27% Accuracy SD : 1.01%
Mean Accuracy : 98.39% Accuracy SD : 0.88%
Mean Accuracy : 98.27% Accuracy SD : 0.98%
Mean Accuracy : 98.05% Accuracy SD : 1.12%
Mean Accuracy : 97.77% Accuracy SD : 1.14%
Mean Accuracy : 97.55% Accuracy SD : 1.15%
Case Study: Time Series and Simple Linear Regression
In [13]: california_df
Out [13]:
• Next, we’ll use Matplotlib and Seaborn to display scatter plots of each of
the eight features.
In [16]: import matplotlib . pyplot as plt
m
(a) Iris setosa (b) Iris versicolor (c) Iris virginica
species
0 setosa
1 setosa
2 setosa
• print ( iris_df . describe () )
# ###############################################################
sepal length ( cm ) sepal width ( cm ) petal length ( cm ) \
count 150.00 150.00 150.00
mean 5.84 3.06 3.76
std 0.83 0.44 1.77
min 4.30 2.00 1.00
25% 5.10 2.80 1.60
50% 5.80 3.00 4.35
75% 6.40 3.30 5.10
max 7.90 4.40 6.90
petal width ( cm )
count 150.00
mean 1.20
std 0.76
min 0.10
25% 0.30
50% 1.30
75% 1.80
max 2.50
Comparing the Computer Cluster Labels to the Iris Dataset’s Target Values
• print ( kmeans . labels_ [0:50])
# ########################################################################
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1]
import numpy as np
for name , estimator in estimators . items () :
estimator . fit ( iris . data )
print ( f ’\ n { name }: ’)
for i in range (0 , 101 , 50) :
labels , counts = np . unique (
estimator . labels_ [ i : i +50] , return_counts = True )
print ( f ’{ i } -{ i +50}: ’)
for label , count in zip ( labels , counts ) :
print ( f ’ label ={ label } , count ={ count } ’)