0% found this document useful (0 votes)
16 views88 pages

CH 15

The document provides an overview of machine learning (ML) as a subset of artificial intelligence, detailing its techniques such as classification, regression, and clustering. It emphasizes the use of Python and libraries like Scikit-learn for building ML models, and discusses various applications and types of ML, including supervised and unsupervised learning. Additionally, it includes a case study on k-means clustering using the Iris dataset and outlines steps for data science studies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views88 pages

CH 15

The document provides an overview of machine learning (ML) as a subset of artificial intelligence, detailing its techniques such as classification, regression, and clustering. It emphasizes the use of Python and libraries like Scikit-learn for building ML models, and discusses various applications and types of ML, including supervised and unsupervised learning. Additionally, it includes a case study on k-means clustering using the Iris dataset and outlines steps for data science studies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

CH-15 Machine Learning: Classification,

Regression and Clustering


15.1 Introduction

• Machine learning (ML) is a subfield of artificial intelligence


(AI) that enables computers to learn from data.
• It allows solving complex problems that were previously difficult
for traditional programming.
• The goal is to provide a hands-on introduction to ML tech-
niques.
What is Machine Learning?
⋄ ML enables computers to learn from data instead of being
explicitly programmed.
⋄ It relies on large datasets and statistical algorithms to im-
prove performance.
⋄ Python is commonly used for building ML models.
Machine Learning as a Subset of AI

Artificial Intelligence (AI)

Machine Learning
Prediction Capabilities
ML is widely used to make accurate predictions in various do-
mains:
⋄ Weather Forecasting: Improves accuracy to minimize
damage and save lives.
⋄ Healthcare: Enhances cancer diagnosis and treatment.
⋄ Business Forecasting: Helps in maximizing profits and
securing jobs.
⋄ Fraud Detection: Identifies fraudulent credit card trans-
actions and insurance claims.
⋄ Customer Churn Prediction: Anticipates customer re-
tention and business growth.
⋄ Real Estate Pricing: Predicts house prices based on mar-
ket trends.
⋄ Entertainment & Sports: Forecasts movie ticket sales
and game-winning strategies.
Machine Learning Applications
Some key applications of ML include:
⋄ Anomaly Detection: Identifying unusual patterns in data
⋄ Chatbots: Automated customer support
⋄ Email Classification: Spam detection
⋄ News Classification: Categorizing articles (sports, politics, etc.)
⋄ Computer Vision: Image recognition and classification
⋄ Fraud Detection: Identifying credit card and insurance fraud
⋄ Customer Churn Prediction: Detecting potential customer dropouts
⋄ Data Mining: Extracting insights from social media
⋄ Object Detection: Identifying objects in images and videos
⋄ Pattern Recognition: Finding trends in data
⋄ Medical Diagnostics: Assisting in disease detection
⋄ Facial Recognition: Identity verification
⋄ Network Intrusion Detection: Preventing cyber threats
⋄ Handwriting Recognition: Digitizing handwritten text
⋄ Marketing Analytics: Customer segmentation for targeted ads
⋄ Language Translation: Translating text between languages
⋄ Mortgage Loan Prediction: Assessing loan default risk
Scikit-learn

• Scikit-learn (or sklearn) is a popular Python library for ma-


chine learning.
• It provides efficient implementations of various machine learning
algorithms, called estimators.
• These estimators encapsulate complex mathematical operations,
making ML more accessible.
• Like using a car or a smartphone without knowing its internal
mechanics, sklearn allows you to build models without deep
mathematical knowledge.
Benefits of Scikit-Learn:
• Requires only a small amount of Python code to build
powerful models.
• Helps in data analysis, extraction of information, and mak-
ing predictions.
• Automates model training and testing.
• Provides default parameters that often yield good results,
but allows customization for optimization.
• auto-sklearn can further automate many ML tasks.
Choosing the Right Scikit-Learn Estimator:
• No single model works best for all datasets, so multiple
models should be tested.
• scikit-learn makes it easy to try different models and
compare their performance.
• The best model is selected based on evaluation metrics.
• Experience helps in choosing models, but experimentation
is often necessary.
• Creating and using models requires only a few lines of code
in scikit-learn.
Types of Machine Learning
1. Supervised Machine Learning
• Works with labeled data (e.g., cat images labeled as ”cat”).
• Trains on known input-output pairs to make predictions on new data.
• More data improves accuracy.
• Used in applications like email spam detection, disease prediction,
and fraud detection.
Supervised Learning Categories:
• Classification: Predicts discrete labels (e.g., spam vs. not spam).
• Regression: Predicts continuous values (e.g., housing prices, tem-
perature forecasts).
2. Unsupervised Machine Learning
• Works with unlabeled data (e.g., customer shopping behavior).
• Finds patterns or structures without predefined labels.
• Used in recommendation systems, anomaly detection, and market
segmentation.
Unsupervised Learning Techniques:
• Clustering: Groups similar data points (e.g., K-Means for customer
segmentation).
• Dimensionality Reduction: Compresses data while preserving im-
portant characteristics (e.g., PCA, t-SNE).
K-Means Clustering and the Iris Dataset
• We’ll present the simplest unsupervised machine-learning algorithm, k-means
clustering, and use it on the Iris dataset that’s also bundled with scikit-learn.
• We’ll use dimensionality reduction (with scikit-learn’s PCA estimator) to com-
press the Iris dataset’s four features to two for visualization purposes.
• K-Means clustering is one of the simplest and most widely used unsupervised
machine learning algorithms.
• It is primarily used for clustering or grouping similar data points together without
requiring labeled data.
• The main objective of K-Means is to divide a given dataset into k clusters, where
k is a user-defined parameter.
• K-Means follows an iterative approach to form clusters by minimizing the distance
between data points and their assigned cluster centers.
1. Choose k: Select the number of clusters (k) to group the data into.
2. Initialize Centroids: Randomly place k initial centroids (the center points
of each cluster).
3. Assign Data Points: Each data point is assigned to the nearest centroid
based on Euclidean distance (or another distance metric).
4. Recalculate Centroids: The centroids are updated by computing the mean
of all points assigned to each cluster.
5. Repeat: Steps 3 and 4 are repeated until the centroids converge (i.e., they
do not change significantly between iterations).
Applying K-Means to the Iris Dataset

• The Iris Dataset: The Iris dataset is a famous dataset in machine learning
that contains 150 samples of iris flowers, divided into three species:
• Setosa
• Versicolor
• Virginica
• Each sample has four features:
1. Sepal length
2. Sepal width
3. Petal length
4. Petal width
• Since K-Means is an unsupervised method, it does not use the labels
during clustering. Instead, it attempts to group the data into three
natural clusters based on feature similarity.
Dimensionality Reduction using PCA
• Since the Iris dataset has four dimensions (features), it is difficult to visu-
alize.
• To simplify visualization, we use Principal Component Analysis (PCA),
a dimensionality reduction technique.
• PCA reduces the four features to two principal components while pre-
serving most of the variance in the data.
• This allows us to plot the data in 2D and observe how K-Means clusters
the samples.

Performing Clustering and Visualizing Results


• Once PCA has reduced the dataset to two dimensions, we apply K-Means
clustering:
• We set k = 3 since we know there are three species.
• K-Means groups the data into three clusters.
• We plot the cluster centroids (the average positions of points in each
cluster).
• We compare K-Means clusters with the actual species labels to check
the clustering accuracy.
Customer Segmentation Using k-Means Clustering
A shopping mall wants to segment its customers based on their purchasing
behavior. The dataset consists of customers’ **Annual Income (in $1000s)**
and **Spending Score (1–100)**.

Customer Annual Income Spending Score


A 15 39
B 16 81
C 30 60
D 45 55
E 48 50
F 55 25
G 65 90
H 75 80
I 85 15
J 90 40
Using k-Means clustering with k = 3, manually assign customers into clusters
based on **Annual Income** and **Spending Score**.
Approach
• Identify 3 initial cluster centers.
• Assign each customer to the nearest cluster.
• Adjust the cluster centers iteratively.
Big Data and Big Computer Processing Power

The scale of data is growing exponentially, surpassing historical records.


• The data produced in recent years equals all previous data since the dawn
of civilization.
• Traditional concerns: “I’m drowning in data and I don’t know what to do
with it.”
• Machine learning shift: “Flood me with big data so I can extract insights
and make predictions.”
• Computing power, memory, and storage are expanding while costs are de-
creasing.
• These advancements allow us to rethink solution approaches.
• Now, we can program computers to learn from vast amounts of data.
• The focus has shifted to data-driven predictions.
Datasets Bundled with Scikit-Learn
• ”Toy” datasets:
⋄ Boston house prices
⋄ Iris plants
⋄ Diabetes
⋄ Optical recognition of handwritten digits
⋄ Linnerrud
⋄ Wine recognition
⋄ Breast cancer Wisconsin (diagnostic)
• Real-world datasets:
⋄ Olivetti faces
⋄ 20 newsgroups text
⋄ Labeled Faces in the Wild face recognition
⋄ Forest cover types
⋄ RCV1
⋄ Kddcup 99
⋄ California Housing
• It also provides capabilities for loading datasets from other sources, such
as the 20,000+ datasets available at openml.org.
Row Type Name Notation

Without target Feature vector, Instance X


With target Labeled instance, Example (X , y )

Steps in a Typical Data Science Study


• Load dataset (e.g., from scikit-learn or openml.org).
• Explore data using pandas and visualizations.
• Transform data (convert non-numeric to numeric for models).
• Split data into training and testing sets.
• Create model (select appropriate ML algorithm).
• Train and test model for accuracy evaluation.
• Tune model for better performance.
• Make predictions on new, unseen data.
15.2 Case Study: Classification with k-Nearest Neighbors
and the Digits Dataset, Part 1

Automating mail processing requires recognizing handwritten text


accurately.
• Postal services use computers to scan and interpret handwritten names,
addresses, and zip codes.
• Machine learning techniques, like k-Nearest Neighbors (k-NN), help solve
these classification problems.
• Scikit-learn simplifies such tasks, making machine learning accessible even
to beginners.
• Future advancements in deep learning (e.g., convolutional neural
networks) further enhance computer vision.
Classification Problems
Supervised machine learning aims to predict the class of a given sample.
• Example: Classifying images as “dog” or “cat” (binary classification).
• The Digits dataset consists of 8×8 pixel images representing 1797 handwritten
digits (0-9).
• Since there are 10 possible classes, this is a multi-class classification problem.
• Training data is labeled—we know each digit’s class beforehand.
• The k-nearest neighbors (k-NN) algorithm will be used for classification.
• The dataset originates from the MNIST database.
Steps in Our Approach:
• Decide the data for training.
• Load and explore the data.
• Split data into training and testing sets.
• Select and build the model.
• Train the model.
• Make predictions.
Upcoming Steps:
• Evaluate the results.
• Tune the model.
• Compare multiple classification models.
• Visualize data with Matplotlib and Seaborn.
Command to launch IPython with Matplotlib support:
• ipython --matplotlib
k-Nearest Neighbors Algorithm
The k-NN algorithm predicts a sample’s class based on its k nearest neighbors.
• Given a test sample, k-NN finds the k training samples closest to it.
• The class with the most votes among the k neighbors is assigned to the test
sample.
• Example:

• Sample X’s three nearest neighbors are all purple → X is classified as


purple.
• Sample Y’s three nearest neighbors are all green → Y is classified as green.
• Sample Z has two red neighbors and one green neighbor → Z is classified
as red.
• Choosing an odd k value prevents ties in classification.
Hyperparameters and Hyperparameter Tuning
Machine learning models have two types of parameters:
• Parameters learned from training data.
• Hyperparameters set before training (e.g., k in k-NN).
Hyperparameter tuning optimizes model performance.
• Experimenting with different k values improves classification
accuracy.
• Scikit-learn provides automated hyperparameter tuning
capabilities.
• Later, we’ll use hyperparameter tuning to find the best k for
the Digits dataset.
Handwritten Digits Dataset (Digits Dataset in scikit-learn)
Loading the Dataset
• The load digits function from sklearn.datasets loads the dataset.
• It returns a Bunch object, a dictionary-like structure with dataset metadata.
• Bunch is a subclass of dict that has additional attributes for interacting
with the dataset.

! pip install scikit - learn


from sklearn . datasets import load_digits
digits = load_digits ()

Displaying the Description


• The Digits dataset bundled with scikit-learn is a subset of the UCI (Uni-
versity of California Irvine) ML hand-written digits dataset at:
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+
Handwritten+Digits.
• A Bunch’s DESCR attribute contains a description of the dataset.

print ( digits . DESCR )


Dataset Description
• A subset of the UCI ML Handwritten Digits Dataset.
• Contains 1797 samples, each an 8×8 image of a handwritten digit (0–9).
• Each sample has 64 features (pixel values from 0 to 16).
• No missing values in the dataset.
• Original dataset had 5620 samples, but scikit-learn version only contains
test samples (1797 images).
Checking Dataset Structure
• digits.data: NumPy array containing 1797 samples × 64 features.
• digits.target: NumPy array with corresponding digit labels (0–9).
• Data shape:
digits . data . shape # Output : (1797 , 64)
digits . target . shape # Output : (1797 ,)

• Checking sample labels:

print ( digits . target [::100])


Output : [0 4 1 7 4 8 2 2 4 4 1 9 7 3 2 1 2 5]
Visualizing the Data
A Sample Digit Image
• Each image is two-dimensional—it has a width and a height in pixels.
• The Bunch object returned by load digits contains an images attribute—an
array in which each element is a two-dimensional 8-by-8 array represent-
ing a digit image’s pixel intensities.
import matplotlib . pyplot as plt
from sklearn . datasets import load_digits
# Load the Digits dataset
digits = load_digits ()
# Select image at index 13
image = digits . images [13]
# Display the image
plt . imshow ( image , cmap = ’ gray ’) # Use grayscale colormap
plt . colorbar () # Show pixel intensity scale
plt . title ( " Handwritten Digit at Index 13 " )
plt . show ()
Creating the Diagram
• You should always familiarize yourself with your data. This process is called data
exploration.
• Handwritten digit recognition is a difficult problem due to variations among the
images.
• Use plt.subplots(nrows=4, ncols=6, figsize=(6, 4)) to create a 6x4 inch
figure with 4 rows 6 columns of subplots.
• subplots returns Figure and Axes objects in a 2D NumPy array.
• Loop through 24 subplots, digit images, and target labels using zip().
• Extract Axes object, image, and target from each tuple.
• Display image using axes.imshow(image, cmap=plt.cm.gray r)
• Remove tick marks using axes.set xticks([]) and axes.set yticks([]).
• Set title using axes.set title(target) to display the digit label.
• Adjust layout using plt.tight layout().
• axes.ravel(): Converts 2D array into 1D.
• zip(): Iterates over multiple sequences in parallel.
• cmap=’gray r’: Displays grayscale images (0 = white, 16 = black).
• set xticks([]), set yticks([]): Hides tick marks.
• set title(target): Shows the actual digit label.
from sklearn . datasets import load_digits

digits = load_digits ()

import matplotlib . pyplot as plt

fig , axes = plt . subplots ( nrows =4 , ncols = 10 , figsize = (10 ,4) )


for item in zip ( axes . ravel () , digits . images , digits . target ) :
axes , image , target = item
axes . imshow ( image , cmap = ’ gray ’)
# OR axes . imshow ( image , cmap = plt . cm . gray_r )
axes . set_xticks ([])
axes . set_yticks ([])
axes . set_title ( target )
plt . tight_layout ()
Splitting the Data for Training and Testing
• We first break the data into a training set and a testing set to prepare to
train and test the model.
• The function train test split from the sklearn.model selection mod-
ule shuffles the data to randomize it, then splits the samples in the data
array and the target values in the target array into training and testing sets.
• The shuffling and splitting is performed conveniently for you by a Shuffle-
Split object from the sklearn.model selection module.
• Function train test split returns a tuple of four elements in which the
first two are the samples split into training and testing sets, and the last
two are the corresponding target values split into training and testing sets.
• The random state parameter controls the randomization. If you specify
a fixed integer (like 1 in this case), it ensures that every time you run the
code, the split remains the same.
• test size=0.25 specifies that 25% of the data is for testing, so train size
is inferred to be 0.75.
from sklearn . model_selection import tr ain_test_split
X_train , X_test , y_train , y_test = train_test_split ( digits . data , digits . target ,
random_state =1 , test_size =0.25)
Labeled Training and Testing Data Representation

Feature Matrix (X) Target (y)


x11 x12 x13 ... x1d ... y1
x21 x22 x23 ... x2d ... y2
.. .. .. .. ..
. . . . .
xn1 1 xn1 2 xn1 3 ... xn1 d ... yn1
xn1 +1,1 xn1 +1,2 xn1 +1,3 . . . xn1 +1,d ... yn1 +1
.. .. .. .. ..
. . . . .
xn,1 xn,2 xn,3 ... xn,d ... yn

Blue rows: Training Data (X, y)


Red rows: Testing Data (X, y)
Training and Testing Set Sizes
print ( X_train . shape ) # (1347 , 64)
print ( X_test . shape ) # (450 , 64)
print ( y_train . shape ) # (1347 ,)
print ( y_test . shape ) # (450 ,)

Creating the Model


• The KNeighborsClassifier estimator
(module sklearn.neighbors) implements the k nearest neigh-
bors algorithm.
# Creating the Model
from sklearn . neighbors import K N e i g h b o r s C l a s s i f i e r
knn = K N e i g h b o r s C l a s s i f i e r ()

Training the Model


• We invoke the KNeighborsClassifier object’s fit method, which
loads the sample training set (X train) and target training set
(y train) into the estimator.
# Training the Model
knn . fit ( X = X_train , y = y_train )
Predicting Digit Classes
• Calling the estimator’s predict method with X test as an argument
returns an array containing the predicted class of each test image.
# Predicting Digit Classes
predicted = knn . predict ( X = X_test )
expected = y_test
predicted [:100]
expected [:100]
wrong = [( int ( p ) , int ( e ) ) for p , e in zip ( predicted , expected ) if p != e ]
print ( wrong )
print ( len ( wrong ) )

OUTPUT :
[(9 , 7) , (7 , 2) , (9 , 5) ]
3
15.3 Case Study: Classification with k-Nearest Neighbors
and the Digits Dataset, Part 2

• evaluate the k-NN classification estimator’s accuracy


• execute multiple estimators and can compare their results so you can choose
the best one(s)
• show how to tune k-NN’s hyperparameter k to get the best performance
out of a KNeighborsClassifier
Metrics for Model Accuracy
• Once you’ve trained and tested a model, you’ll want to measure its accu-
racy.
Estimator Method score
⋄ Each estimator has a score method that returns an indication of how
well the estimator performs for the test data you pass as arguments.
⋄ For classification estimators, this method returns the prediction ac-
curacy for the test data.
print ( f ’{ knn . score ( X_test , y_test ) :.2 f } ’) # OUTPUT : 0.99

⋄ The kNeighborsClassifier’s with its default k (that is, n neighbors=5)


achieved 97.78
Confusion Matrix
⋄ Shows the correct and incorrect predicted values.
⋄ Also known as the hits and misses.
⋄ Call the function confusion matrix from the sklearn.metrics
module, passing the expected classes and the predicted classes
as arguments.
from sklearn . metrics import confusion _matrix
confusion = conf usion_matrix ( y_true = expected , y_pred = predicted )
print ( confusion )
# #################################################################
OUTPUT :
[[53 0 0 0 0 0 0 0 0 0]
[ 0 42 0 0 0 0 0 0 0 0]
[ 0 0 40 0 0 0 0 1 0 0]
[ 0 0 0 52 0 0 0 0 0 0]
[ 0 0 0 0 47 0 0 0 0 0]
[ 0 0 0 0 0 38 0 0 0 1]
[ 0 0 0 0 0 0 43 0 0 0]
[ 0 0 0 0 0 0 0 47 0 1]
[ 0 0 0 0 0 0 0 0 37 0]
[ 0 0 0 0 0 0 0 0 0 48]]

⋄ The 1 at column index 7 indicates that one 2 was incorrectly


classified as 7.
⋄ The 1 at column index 9 indicates that one 5 was incorrectly
classified as 9.
⋄ The 1 at column index 9 indicates that one 7 was incorrectly
classified as 9.
Classification Report
⋄ The sklearn.metrics module also provides function
classification report, which produces a table of classification
metrics based on the expected and predicted values.
from sklearn . metrics import c l a s s i f i c a t i o n _ r e p o r t
names =[ str ( name ) for name in digits . target_names ]
print ( c l a s s i f i c a t i o n _ r e p o r t ( expected , predicted , target_names =
names ) )
# ##############################################################
OUTPUT :
precision recall f1 - score support

0 1.00 1.00 1.00 53


1 1.00 1.00 1.00 42
2 1.00 0.98 0.99 41
3 1.00 1.00 1.00 52
4 1.00 1.00 1.00 47
5 1.00 0.97 0.99 39
6 1.00 1.00 1.00 43
7 0.98 0.98 0.98 48
8 1.00 1.00 1.00 37
9 0.96 1.00 0.98 48

accuracy 0.99 450


macro avg 0.99 0.99 0.99 450
weighted avg 0.99 0.99 0.99 450
Confusion Matrix for Classification

Predicted: Positive Predicted: Negative


Actual: Positive True Positive (TP) False Negative (FN)
Actual: Negative False Positive (FP) True Negative (TN)
Precision (P): Measures the proportion of correctly predicted positive
instances out of all instances predicted as positive.
True Positives (TP)
P= (1)
True Positives (TP) + False Positives (FP)
Significance: High precision means fewer false positives, useful in spam
detection.

Recall (R): Measures the proportion of correctly predicted positive instances


out of all actual positive instances.
True Positives (TP)
R= (2)
True Positives (TP) + False Negatives (FN)
Significance: High recall ensures fewer false negatives, critical in medical
diagnosis.
F1-Score: Harmonic mean of precision and recall, balancing both metrics.
P ×R
F1 = 2 × (3)
P +R
F 1 ∈ [0, 1].
A high F 1 score means the model balances precision and recall well, while a
low F 1 score means it struggles with one or both.
Significance: Especially useful when:
• Class distribution is imbalanced
• Both false positives and false negatives are important

Support: The number of actual occurrences of each class in the dataset.


Significance: Helps in understanding class distribution and ensures fair
evaluation across all classes.
Visualizing the Confusion Matrix
• A heat map displays values as colors, often with values of higher magnitude
displayed as more intense colors.
• Seaborn’s graphing functions work with two-dimensional data. When using
a pandas DataFrame as the data source, Seaborn automatically labels its
visualizations using the column names and row indices.
import seaborn as sns
import pandas as pd
confusion_df = pd . DataFrame ( confusion , index = range (10) , columns = range (10) )
axes = sns . heatmap ( confusion_df , annot = True , cmap = ’ nip y_spectral_r ’)
• annot=True: puts the count (e.g., 12, 0, 3...) inside each cell
• cmap=’nipy spectral r’: to color map with the colors in the heat map’s
color bar
• Matplotlib and Seaborn Colormaps:
⋄ In Matplotlib, colormaps are used to map numerical data values to
colors in plots like heatmaps, scatterplots, or image plots.
⋄ Seaborn also supports colormaps, especially in plots like heatmaps
(sns.heatmap) or kdeplots (sns.kdeplot).
• Common Color Maps (cmap) in Matplotlib
⋄ Sequential colormaps (e.g., viridis, plasma) are used for ordered data.
⋄ Diverging colormaps (e.g., coolwarm, PiYG) are for datasets that
have a meaningful midpoint (e.g., zero).
⋄ Qualitative colormaps (e.g., Set1, Pastel1) work well for categorical
data.
• Common Color Maps (cmap) in Seaborn
⋄ Blues: Shades of blue, where darker blue indicates higher values.
⋄ coolwarm: Blue for low values, red for high values.
⋄ viridis: A perceptually uniform color gradient (dark blue to yellow).
⋄ Greens or Reds: Single-color intensity changes.
K-Fold Cross-Validation
⋄ K-Fold Cross-Validation is a technique to assess model performance by
repeatedly splitting data into training and testing sets.
⋄ Dataset is split into k equal-sized folds.
⋄ The model is trained k times, each time using a different fold as the
validation set, and the remaining folds for training.
⋄ The final score is the average of the k validation results.
⋄ For k=10:

F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 1
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 2
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 3
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 4
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 5
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 6
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 7
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 8
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 9
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 Iter. 10

Validation Set Training Set


Using Scikit-Learn:
• Step 1: Import necessary libraries
from sklearn . model_selection import KFold , cross_val_score

• Step 2: Define K-Fold cross-validation


kfold = KFold ( n_splits =10 , random_state =1 , shuffle = True )

• Step 3: Perform cross-validation


scores = cross_val_score ( estimator = knn , X = digits . data , y = digits . target , cv =
kfold )
print ( scores )
# #############################################################
OUTPUT :
[1. 0.98888889 0.98888889 0.96666667 0.98333333 0.98888889
0.98888889 0.98882682 0.98882682 0.99441341]

• Step 4: Compute model performance


print ( f ’ Mean accuracy : { scores . mean () :.2%} ’)
print ( f ’ Accuracy standard deviation : { scores . std () :.2%} ’)
# ##########################################################
OUTPUT :
Mean Accuracy : 0.99
Accuracy Standard Deviation : 0.01
Running Multiple Models to Find the Best One

• Choosing the Best Model: It’s hard to determine the best machine learn-
ing model in advance.
• Model Performance: Some models may perform better than others on a
given dataset.
• Scikit-learn’s Flexibility: Provides multiple models for quick training and
testing.
• Encouragement to Experiment: Running multiple models helps find the
best one.
• Comparing Models: Evaluating KNeighborsClassifier, SVC, and
GaussianNB.
• Ease of Testing: Scikit-learn allows easy testing of models with default
settings.
• Let’s use the techniques from the preceding section to compare several
classification estimators—KNeighborsClassifier, SVC and GaussianNB
(there are more).
from sklearn . svm import SVC
from sklearn . naive_bayes import GaussianNB

estimators = { ’ K N e i g h b o r s C l a s s i f i e r ’: knn , ’ SVC ’: SVC () , ’ GaussianNB ’:


GaussianNB () }

for estimator_name , estimator_obje ct in estimators . items () :


kfold = KFold ( n_splits =10 , random_state =1 , shuffle = True )
scores = cross_val_score ( estimator = estimator_object , X = digits . data , y =
digits . target , cv = kfold )
print ( f ’{ estimator_name : >20}: ’
f ’ Mean Accuracy : { scores . mean () :.2 f } ’+
f ’ Accuracy SD : { scores . std () :.2 f } ’)
# ############################################################
OUTPUT :
K N e i b h o r s C l a s s i f i e r : Mean Accuracy : 0.99 Accuracy SD : 0.01
SVC : Mean Accuracy : 0.99 Accuracy SD : 0.01
GaussianNB : Mean Accuracy : 0.84 Accuracy SD : 0.02

• Based on the results, it appears that we can get better accuracy from
the KNeibhorsClassifier and SVC estimators—at least when using the
estimator’s default settings.
k-Nearest Neighbors (kNN) and Hyperparameter Tuning
• Hyperparameters are parameters set before training a model. In the k-
nearest neighbors (kNN) algorithm, k is a hyperparameter that determines
the number of nearest neighbors used for classification.
• The best value of k is determined through hyperparameter tuning. Testing
different values and evaluating their performance helps in selecting the
most optimal k.
• A common approach for evaluating different values of k is k-fold cross-
validation. In this process, the dataset is divided into k subsets, and the
model is trained and tested multiple times using different subsets.
• In practice, odd values of k are preferred to avoid ties. For the Digits
dataset, the highest accuracy (98.83%) was observed when k = 1, and
accuracy tended to decrease as k increased.
• A higher value of k smoothens decision boundaries, making the model less
sensitive to noise but potentially reducing accuracy.
• The computational cost of kNN increases with higher k, as more distances
need to be calculated to find the nearest neighbors. Efficient data handling
and computational resources are necessary for large datasets.
• The cross validate function can be used to perform cross-validation
while also measuring execution time, providing insights into both accuracy
and computational efficiency.
from sklearn . neighbors import K N e i g h b o r s C l a s s i f i e r
for k in range (1 ,20 ,2) :
kfold = KFold ( n_splits =10 , random_state =1 , shuffle = True )
knn = K N e i g h b o r s C l a s s i f i e r ( n_neighbors = k )
scores = cross_val_score ( estimator = knn , X = digits . data , y = digits . target , cv =
kfold )
print ( f ’ Mean Accuracy : { scores . mean () :.2%} Accuracy SD : { scores . std () :.2%} ’)
# ###################################################################
OUTPUT :
Mean Accuracy : 98.72% Accuracy SD : 0.70%
Mean Accuracy : 98.83% Accuracy SD : 0.80%
Mean Accuracy : 98.78% Accuracy SD : 0.82%
Mean Accuracy : 98.50% Accuracy SD : 0.86%
Mean Accuracy : 98.27% Accuracy SD : 1.01%
Mean Accuracy : 98.39% Accuracy SD : 0.88%
Mean Accuracy : 98.27% Accuracy SD : 0.98%
Mean Accuracy : 98.05% Accuracy SD : 1.12%
Mean Accuracy : 97.77% Accuracy SD : 1.14%
Mean Accuracy : 97.55% Accuracy SD : 1.15%
Case Study: Time Series and Simple Linear Regression

• Simple linear regression models the relationship between an independent


and dependent variable using a straight line.
• Previously, a regression model was applied to New York City’s January high
temperatures (1895–2018).
• The regression line was obtained using scipy.stats.linregress, and
predictions were made for future and past temperatures.
• Using Scikit-Learn for Simple Linear Regression, we re-implement the pre-
vious regression model using a Scikit-Learn estimator.
• We will visualize the data using seaborn.scatterplot for data points and
matplotlib.pyplot.plot for the regression line.
• We will make predictions using the regression model’s coefficient and
intercept.
Dataset Information
• Data file: ave hi nyc jan 1895-2018.csv (located in ch15).
Click here to download
• Preprocessing Steps:
⋄ Load data into a Pandas DataFrame.
⋄ Rename the ’Value’ column to ’Temperature’.
⋄ Remove 01 from date values using floordiv(100).
Code Snippet for Data Loading
import pandas as pd
nyc = pd . read_csv ( ’ ave_hi_nyc_jan_1895 -2018. csv ’)
nyc . columns = [ ’ Date ’ , ’ Temperature ’ , ’ Anomaly ’]
nyc . Date = nyc . Date . floordiv (100)
print ( nyc )
# ###############################################
OUTPUT :
Date Temperature Anomaly
0 1895 34.2 -3.2
1 1896 34.7 -2.7
2 1897 35.5 -1.9
3 1898 39.6 2.2
4 1899 36.4 -1.0
.. ... ... ...
119 2014 35.5 -1.9
120 2015 36.1 -1.3
121 2016 40.8 3.4
122 2017 42.8 5.4
123 2018 38.7 1.3

[124 rows x 3 columns ]


Splitting the Data for Training and Testing

• We use Scikit-Learn’s LinearRegression estimator for simple linear


regression.
• Since scikit-learn estimators require two-dimensional input, we reshape
the Date column into a two-dimensional array.
• The Date column is the independent variable (X).
• The Temperature column is the dependent variable (y).
• Convert the Date column into a NumPy array.
• Reshape it to (n,1) format using .reshape(-1,1).
from sklearn . model_selection import t rain_test_split

X_train , X_test , y_train , y_test = train_test_s plit (


nyc . Date . values . reshape ( -1 , 1) ,
nyc . Temperature . values , random_state =11)

print ( X_train . shape ) # Output : (93 , 1)


print ( X_test . shape ) # Output : (31 , 1)
Training the Model

• LinearRegression is used since simple linear regression is a special case


of multiple linear regression.
from sklearn . linear_model import LinearRegression

l i n e ar_ re gre ss ion = LinearReg ression ()


l i n e ar_ re gre ss ion . fit ( X = X_train , y = y_train )

• We can get the slope and intercept used in the y = mx + b calculation


to make predictions.
• The slope is stored in the estimator’s coeff attribute (m in the equation)
and the intercept is stored in the estimator’s intercept attribute (b in
the equation)
print ( lr . coef_ )
print ( lr . intercept_ )
# #####################
OUTPUT :
[0.02379319]
-8.877948351191982
Testing the Model

predicted = lr . predict ( X = X_test )


expected = y_test
print ( f " { ’ Pred ’: >5}\ t { ’ Esti ’: >5} " )
for p , e in zip ( predicted [::5] , expected [::5]) :
print ( f ’{ p : >5.2 f }\ t { e : >5.2 f } ’)
# ############################################
OUTPUT :
Pred Esti
37.66 36.20
38.52 40.90
36.26 35.50
36.23 34.70
38.14 32.60
37.19 34.30
36.69 38.90

Predicting Future Temperatures and Estimating Past Temperatures


predict = ( lambda x : lr . coef_ * x + lr . intercept_ )
print ( predict (2025) ) # [39.30326887]
print ( predict (1825) ) # [34.54463013]
Visualizing the Dataset with the Regression Line

import seaborn as sns


axes = sns . scatterplot ( data = nyc , x = ’ Date ’ , y = ’ Temperature ’ , hue = ’ Temperature ’ ,
palette = ’ winter ’ , legend = False )
axes . set_ylim (10 ,60)
import numpy as np
x = np . array ([ min ( nyc . Date . values ) , max ( nyc . Date . values ) ])
y = predict ( x )
import matplotlib . pyplot as plt
line = plt . plot (x , y )
Scatterplot Keyword Arguments:
• data: Specifies the DataFrame (e.g., nyc) containing the data to display.
• x and y: Indicate the column names for the x-axis and y-axis, respectively.
⋄ Example: x=’Date’, y=’Temperature’.
⋄ The corresponding values form x–y coordinate pairs for plotting.
• hue: Determines dot colors based on values in the specified column.
⋄ Example: hue=’Temperature’.
⋄ Adds visual interest, though not essential in this case.
• palette: Specifies the Matplotlib color map used to color the dots.
• legend=False: Omits the legend from the graph (default is True).
⋄ Not required in this example.
Evaluate Model Performance:
# 1. Mean Squared Error ( MSE )
from sklearn . metrics import mean_squared_error
mse = me an _s qu are d_error ( expected , predicted )
print ( F ’ MSE :{ mse :0.2 f } ’)
# ############################
OUTPUT :
MSE :17.55 # LOW MSE MEANS HIGH ACCURACY .
Overfitting vs Underfitting
When creating a model, a key goal is to ensure that it is capable of making
accurate predictions for data it has not yet seen. Two common problems that
prevent accurate predictions are overfitting and underfitting:
Underfitting:
• Underfitting occurs when a model is too simple to make predictions, based
on its training data.
• For example, you may use a linear model, such as simple linear regres-
sion (y = β0 + β1 x + ϵ), when in fact, the problem really requires a
non-linear model (y = β0 e β1 x + ϵ).
• For example, temperatures vary significantly throughout the four seasons.
If you’re trying to create a general model that can predict temperatures
year-round, a simple linear regression model will underfit the data.
Overfitting:
• Overfitting occurs when your model is too complex.
• That may be acceptable if your new data looks exactly like your training
data, but ordinarily that’s not the case.
• When you make predictions with an overfit model, new data that matches
the training data will produce perfect predictions, but the model will not
know what to do with data it has never seen.
15.5 Case Study: Multiple Linear Regression with the
California Housing Dataset

• We will be using the California Housing dataset (20,640 samples, 8 fea-


tures) to perform multiple linear regression using all features for better
housing price predictions.
• This approach is expected to yield more meaningful results than using
individual features.
• Visualization will be done with Matplotlib and Seaborn, and the user is
instructed to launch IPython with Matplotlib support.
ipython -- matplotlib

Loading the Dataset


• Source: Derived from the 1990 U.S. Census.
• Sample Size: 20,640 samples, each representing a census block group
(600–3,000 people).
• Features (8 total):
• Median income (in tens of thousands; e.g., 8.37 = $83,700)
• Median house age (max = 52)
• Average number of rooms
• Average number of bedrooms
• Block population
• Average house occupancy
• Latitude
• Longitude
• Target (Output Variable): Median house value (in hundreds of thousands;
e.g., 3.55 = $355,000)
• Expectation:
• Features like more rooms, more bedrooms, or higher income likely
indicate higher house value.
• Combining all features allows for more accurate predictions through
multiple linear regression.
• Loading the Data
In [1]: from sklearn . datasets import f e t c h _ c a l i f o r n i a _ h o u s i n g
In [2]: california = f e t c h _ c a l i f o r n i a _ h o u s i n g ()

• Displaying the Dataset’s Description


In [3]: print ( california . DESCR )
In [4]: california . data . shape
Out [4]: (20640 , 8)

In [5]: california . target . shape


Out [5]: (20640 ,)

In [6]: california . feature_names


Out [6]:
[ ’ MedInc ’ ,
’ HouseAge ’ ,
’ AveRooms ’ ,
’ AveBedrms ’ ,
’ Population ’ ,
’ AveOccup ’ ,
’ Latitude ’ ,
’ Longitude ’]
Exploring the Data with Pandas
• Import pandas and set some options:
In [7]: import pandas as pd
In [8]: pd . set_option ( ’ precision ’ , 4)
In [9]: pd . set_option ( ’ max_columns ’ , 9)
In [10]: pd . set_option ( ’ display . width ’ , None )

• ’display.precision’ is the maximum number of digits to display to the


right of each decimal point.
• ’display.max columns’ is the maximum number of columns to display
when you output the DataFrame’s string representation. We’ll have nine
columns in the DataFrame—the eight dataset features in california.data
and an additional column for the target median house values
(california.target).
• ’display.width’ specifies the width in characters of your Command Prompt
(Windows), Terminal (macOS/Linux) or shell (Linux). The value None
tells pandas to auto-detect the display width when formatting string rep-
resentations of Series and DataFrames.
Allows Pandas to use the full width of the terminal or screen when printing
DataFrames.
• In [11]: california_df = pd . DataFrame ( california . data , columns = california .
feature_names )

In [12]: california_df [ ’ MedHouseValue ’] = pd . Series ( california . target )

In [13]: california_df
Out [13]:

In [14]: california_df . describe ()


Out [14]:
Visualizing the Features
• It’s helpful to visualize your data by plotting the target value against each
feature.
• To make our visualizations clearer, let’s use DataFrame method sample to
randomly select 10% of the 20,640 samples for graphing purposes:
In [15]: sample_df = california_df . sample ( frac =0.1 , random_state =10)

• Next, we’ll use Matplotlib and Seaborn to display scatter plots of each of
the eight features.
In [16]: import matplotlib . pyplot as plt

In [17]: import seaborn as sns

In [18]: sns . set ( font_scale =2)

In [19]: sns . set_style ( ’ whitegrid ’)

In [20]: for feature in california . feature_names :


...: plt . figure ( figsize =(16 ,9) )
...: sns . scatterplot ( data = sample_df , x = feature ,
...: y = ’ MedHouseValue ’ , hue = ’ MedHouseValue ’ ,
...: palette = ’ cool ’ , legend = False )
Splitting the Data for Training and Testing
• In [21]: from sklearn . model_selection import train_test_spl it
In [22]: X_train , X_test , y_train , y_test = train_test_s plit (
...: california . data , california . target , random_state =11)
...:
In [23]: X_train . shape
Out [23]: (15480 , 8)
In [24]: X_test . shape
Out [24]: (5160 , 8)

Training the Model


• In [25]: from sklearn . linear_model import LinearRegres sion
In [26]: li nea r_ reg re ssi on = LinearRegre ssion ()
In [27]: lin ea r_r eg res si on . fit ( X = X_train , y = y_train )
Out [27]:
L i n earReg ression ( copy_X = True , fit_intercept = True , n_jobs = None , normalize =
False )

In [28]: for i , name in enumerate ( california . feature_names ) :


...: print ( f ’{ name : >10}: { l ine ar _re gr ess io n . coef_ [ i ]} ’)
...:
MedInc : 0 . 43 7 7 03 0 21 5 3 82 2 0 6
HouseAge : 0 . 0 0 9 2 1 6 8 3 4 5 6 5 7 9 7 7 1 3
AveRooms : -0.10732526637360985
AveBedrms : 0 .6 117 13 30 739 18 11
Population : -5.756822009298454 e -06
AveOccup : -0.0033845664657163703
Latitude : -0.419481860964907
Longitude : -0.4337713349874016
In [29]: li nea r_ reg re ssi on . intercept_
Out [29]: -36.88295065605547
Testing the Model
• In [30]: predicted = lin ea r_r eg res si on . predict ( X_test )
In [31]: expected = y_test
In [32]: predicted [:5]
Out [32]: array ([1.25396876 , 2.34693107 , 2.03794745 , 1.8701254 ,
2.53608339])
In [33]: expected [:5]
Out [33]: array ([0.762 , 1.732 , 1.125 , 1.37 , 1.856])

Visualizing the Expected vs. Predicted Prices


• In [34]: df = pd . DataFrame ()
In [35]: df [ ’ Expected ’] = pd . Series ( expected )
In [36]: df [ ’ Predicted ’] = pd . Series ( predicted )
In [37]: figure = plt . figure ( figsize =(9 , 9) )
In [38]: axes = sns . scatterplot ( data = df , x = ’ Expected ’ , y = ’ Predicted ’ ,
...: hue = ’ Predicted ’ , palette = ’ cool ’ , legend = False )
In [39]: start = min ( expected . min () , predicted . min () )
In [40]: end = max ( expected . max () , predicted . max () )
In [41]: axes . set_xlim ( start , end )
Out [41]: ( -0.6830978604144491 , 7. 1 5 57 1 9 81 8 4 96 8 34 )
In [42]: axes . set_ylim ( start , end )
Out [42]: ( -0.6830978604144491 , 7. 1 5 57 1 9 81 8 4 96 8 34 )
In [43]: line = plt . plot ([ start , end ] , [ start , end ] , ’k - - ’)
Regression Model Metrics
• In [44]: from sklearn import metrics
In [45]: metrics . r2_score ( expected , predicted )
Out [45]: 0 .6 0 0 89 8 3 11 5 9 64 3 3 3

In [46]: metrics . me a n_ s q ua r ed _ e rr o r ( expected , predicted )


Out [46]: 0 . 53 5 0 14 9 7 74 4 4 91 1 9
Choosing the Best Model
• In [47]: from sklearn . linear_model import ElasticNet , Lasso , Ridge
In [48]: estimators = {
...: ’ LinearRegr ession ’: linear_regression ,
...: ’ ElasticNet ’: ElasticNet () ,
...: ’ Lasso ’: Lasso () ,
...: ’ Ridge ’: Ridge ()
...: }

In [49]: from sklearn . model_selection import KFold , cross_val_score


In [50]: for estimator_name , estimator_obje ct in estimators . items () :
...: kfold = KFold ( n_splits =10 , random_state =11 , shuffle = True )
...: scores = cross_val_score ( estimator = estimator_object ,
...: X = california . data , y = california . target , cv = kfold ,
...: scoring = ’ r2 ’)
...: print ( f ’{ estimator_name : >16}: ’ +
...: f ’ mean of r2 scores ={ scores . mean () :.3 f } ’)
...:
L i nearReg ression : mean of r2 scores =0.599
ElasticNet : mean of r2 scores =0.423
Lasso : mean of r2 scores =0.285
Ridge : mean of r2 scores =0.599
15.6 Case Study: Unsupervised Machine Learning, Part
1—Dimensionality Reduction
• Unsupervised learning helps discover patterns in unlabeled data.
• Visualization is easy for low-dimensional data (e.g., date vs. temperature)
• For 3 variables, use 3D plots with libraries like Matplotlib or Seaborn.
• For high-dimensional data (e.g., 64 features in Digits dataset), direct visu-
alization is not feasible.
• Use dimensionality reduction to project data into 2 or 3 dimensions.
• This can reveal clusters or patterns suggesting use of classification algo-
rithms.
• Cluster analysis can help infer class labels by examining grouped samples
• Reducing dimensions can also speed up model training and improve per-
formance.
• Dimensionality reduction may also reduce model accuracy.
• The curse of dimensionality refers to the challenges of working with high-
dimensional data.
• Correlated features can be removed to simplify models.
• The Digits dataset can be visualized in 2D by ignoring labels and applying
dimensionality reduction.
The Digits dataset has 10 labeled classes (digits 0–9). Ignoring these la-
bels, we can apply dimensionality reduction to project the data into two
dimensions for visualization.
Loading the Digits Dataset
% matplotlib inline

from sklearn . datasets import load_digits


digits = load_digits ()

Creating a TSNE Estimator for Dimensionality Reduction


• We’ll use the TSNE estimator (from the sklearn.manifold module) to per-
form dimensionality reduction.
• This estimator uses an algorithm called t-distributed Stochastic Neighbor
Embedding (t-SNE) to analyze a dataset’s features and reduce them to
the specified number of dimensions.
from sklearn . manifold import TSNE
tsne = TSNE ( n_components = 2 , random_state =10)
Transforming the Digits Dataset’s Features into Two Dimensions
• In scikit-learn, dimensionality reduction involves training the estimator and
transforming the data—either in two steps (fit and transform) or in one
with fit transform.
reduced_data = tsne . fit_transform ( digits . data )
reduced_data . shape
# (1797 , 2)

Visualizing the Reduced Data


• Now reduced to two dimensions, we’ll use Matplotlib’s scatter (not
Seaborn’s) to plot, as it returns a collection useful for later reuse.
import matplotlib . pyplot as plt

plt . scatter ( reduced_data [: ,0] , reduced_data [: ,1] , color = ’ black ’)


Visualizing the Reduced Data with Different Colors for Each Digit
• Clusters are visible, but it’s unclear if they represent the same digit.
• Use c=digits.target in plt.scatter to color dots by digit labels.
• Use cmap=plt.get cmap(’nipy spectral r’, 10) for 10 distinct colors.
• Add plt.colorbar(dots) to show which color represents which digit.
dots = plt . scatter ( reduced_data [: , 0] , reduced_data [: , 1] , c = digits . target
, cmap = plt . get_cmap ( ’ nipy_spectral_r ’ , 10) )
colorbar = plt . colorbar ( dots )
• 10 clear clusters appear, suggesting digits are separable.
• This supports using supervised learning like k-nearest neighbors.
15.7 Case Study: Unsupervised Machine Learning, Part
2—k-Means Clustering

• k-means is a simple unsupervised learning algorithm for clustering unlabeled


data.
• k = number of clusters specified in advance.
• Uses distance-based grouping, like k-nearest neighbors.
• Each cluster has a centroid (center point).
• Initial step: k centroids are randomly selected from the data.
• Samples are assigned to the nearest centroid.
• Centroids are iteratively updated, and samples are re-assigned until con-
vergence.
• The algorithm’s results are:
⋄ A 1D array of labels indicating the cluster to which each sample
belongs and
⋄ A two-dimensional array of centroids representing the center of each
cluster.
Iris Dataset
• We use the Iris dataset from scikit-learn, often used for classification and
clustering.
• Though labeled, we’ll ignore labels to demonstrate unsupervised clustering.
• Later, we’ll compare clusters to true labels to check accuracy.
• It’s a toy dataset with:
• 150 samples
• 4 features: sepal length, sepal width, petal length, petal width (in
cm)
• 3 species: setosa, versicolor, virginica

m
(a) Iris setosa (b) Iris versicolor (c) Iris virginica

Figure: Images of Iris flower species


Loading the Iris Dataset
• % matplotlib inline

from sklearn . datasets import load_iris


iris = load_iris ()
print ( iris . DESCR )

Checking the Numbers of Samples, Features and Targets


• iris . data . shape # (150 , 4)
iris . target . shape # (150 ,)
iris . target_names # array ([ ’ setosa ’, ’ versicolor ’, ’ virginica ’] , dtype = ’ <
U10 ’)
iris . feature_names
# [ ’ sepal length ( cm ) ’,
’ sepal width ( cm ) ’ ,
’ petal length ( cm ) ’ ,
’ petal width ( cm ) ’]
Exploring the Iris Dataset: Descriptive Statistics with Pandas
• import pandas as pd
pd . set_option ( ’ display . max_columns ’ ,5)
pd . set_option ( ’ display . width ’ , None )
iris_df = pd . DataFrame ( iris . data , columns = iris . feature_names )
iris_df [ ’ species ’] = pd . Series ([ iris . target_names [ i ] for i in iris . target
])
print ( iris_df )
# #################################################################
sepal length ( cm ) sepal width ( cm ) petal length ( cm ) petal width ( cm )
\
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2

species
0 setosa
1 setosa
2 setosa
• print ( iris_df . describe () )
# ###############################################################
sepal length ( cm ) sepal width ( cm ) petal length ( cm ) \
count 150.00 150.00 150.00
mean 5.84 3.06 3.76
std 0.83 0.44 1.77
min 4.30 2.00 1.00
25% 5.10 2.80 1.60
50% 5.80 3.00 4.35
75% 6.40 3.30 5.10
max 7.90 4.40 6.90

petal width ( cm )
count 150.00
mean 1.20
std 0.76
min 0.10
25% 0.30
50% 1.30
75% 1.80
max 2.50

iris_df [ ’ species ’ ]. describe ()


# ################################################################
count 150
unique 3
top setosa
freq 50
Name : species , dtype : object
Visualizing the Dataset with a Seaborn pairplot
• Goal: Understand feature relationships by plotting feature pairs.
• Dataset: 4 features (sepal length, sepal width, petal length, petal width).
• Problem: Cannot plot 3+ features in 2D easily.
• Solution: Use pairplot to create a grid of pairwise plots.
• import seaborn as sns

sns . set ( font_scale =1)


sns . set_style ( ’ whitegrid ’)
grid = sns . pairplot ( data = iris_df , vars = iris_df . columns [0:4] , hue = ’ species ’)

• vars=iris df.columns[:4] → Selects the first four columns (usually the


numerical features: sepal length, sepal width, petal length, petal width).
These are used for pairwise comparisons.
• The graphs along the top-left-to-bottom-right diagonal, show the distribu-
tion of just the feature plotted in that column, with the range of values (left-
to-right) and the number of samples with those values (top-to-bottom).
Using a KMeans Estimator
Creating the Estimator

• from sklearn . cluster import KMeans


kmeans = KMeans ( n_clusters =3 , random_state =10)

Fitting the Model


• kmeans . fit ( iris . data )

Comparing the Computer Cluster Labels to the Iris Dataset’s Target Values
• print ( kmeans . labels_ [0:50])
# ########################################################################
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1]

print ( kmeans . labels_ [50:100])


# ########################################################################
[0 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2]

print ( kmeans . labels_ [100:150])


# ########################################################################
[0 2 0 0 0 0 2 0 0 0 0 0 0 2 2 0 0 0 0 2 0 2 0 2 0 0 2 2 0 0 0 0 0 2 0 0 0
0 2 0 0 0 2 0 0 0 2 0 0 2]

• When the training completes, the KMeans object contains:


⋄ A labels array with values from 0 to n clusters - 1.
⋄ A cluster centers array in which each row represents a centroid.
Dimensionality Reduction with Principal Component Analysis
• We’ll use the PCA estimator (from the sklearn.decomposition module)
to perform dimensionality reduction.
• This estimator uses an algorithm called principal component analysis to
analyze a dataset’s features and reduce them to the specified number of
dimensions.
Creating the PCA Object
• from sklearn . decomposition import PCA
pca = PCA ( n_components =2 , random_state =10)

Transforming the Iris Dataset’s Features into Two Dimensions


• pca . fit ( iris . data )
iris_pca = pca . transform ( iris . data )
iris_pca . shape # (150 , 2)
Visualizing the Reduced Data
• iris_pca_df = pd . DataFrame ( iris_pca , columns =[ ’ Component1 ’ ,
Component2 ’ ])
iris_pca_df [ ’ species ’] = iris_df . species
axes = sns . scatterplot ( data = iris_pca_df , x = ’ Component1 ’ ,
y = ’ Component2 ’ , hue = ’ species ’ ,
palette = ’ cool ’ , legend = ’ brief ’)
iris_centers = pca . transform ( kmeans . cluste r_centers_ )
import matplotlib . pyplot as plt
dots = plt . scatter ( iris_centers [: ,0] , iris_centers [: ,1] ,
s =100 , c = ’k ’)
Choosing the Best Clustering Estimator
• from sklearn . cluster import DBSCAN , MeanShift , SpectralClustering ,
AgglomerativeClustering
estimators = { ’ KMeans ’: kmeans , ’ DBSCAN ’: DBSCAN () , ’ MeanShift ’:
MeanShift () , ’ S pe c tr a l Cl u st e r in g ’: S p e ct r al C l us t er i n g ( n_clusters =3) , ’
A g g l o m e r a t i v e C l u s t e r i n g ’: A g g l o m e r a t i v e C l u s t e r i n g ( n_clusters =3) }

import numpy as np
for name , estimator in estimators . items () :
estimator . fit ( iris . data )
print ( f ’\ n { name }: ’)
for i in range (0 , 101 , 50) :
labels , counts = np . unique (
estimator . labels_ [ i : i +50] , return_counts = True )
print ( f ’{ i } -{ i +50}: ’)
for label , count in zip ( labels , counts ) :
print ( f ’ label ={ label } , count ={ count } ’)

You might also like