0% found this document useful (0 votes)
9 views532 pages

Python Machine Learning 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views532 pages

Python Machine Learning 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 532

Index

Chapter 1 Introduction
1. Purpose
2. About the Execution Environment for Source Code
Chapter 2 For beginners
1. Simple Linear Regression with Synthetic Data to Predict Product
Pricing
2. Logistic Regression Visualization for Predicting Customer Purchase
3. K-Means Clustering: Customer Segmentation Problem
4. Decision Tree Classification for Customer Churn Prediction
5. Random Forest Feature Importance Analysis for Customer Churn
Prediction
6. Support Vector Machine Classification and Visualization Problem
7. Principal Component Analysis with Visualization for Customer Data
Analysis
8. Confusion Matrix Visualization in Model Evaluation
9. ROC Curve Analysis for Binary Classifier Performance
10. Time Series Data Generation and Basic Line Plot for Machine
Learning
11. Heatmap of Correlation Matrix for Customer Sales Data Analysis
12. Histogram of Feature Distributions for Machine Learning Model
Inputs
13. Box Plot Outlier Detection in Customer Sales Data
14. Data Normalization and Visual Comparison for Machine Learning
15. Creating and Visualizing Interaction Terms in Machine Learning
16. Visualizing Customer Preferences using t-SNE
17. Polynomial Feature Generation and Visualization in Machine
Learning
18. Custom Loss Function Visualization for a Manufacturing
Optimization Problem
19. Cross-Validation Results Visualization in Machine Learning
20. Visualizing Decision Boundaries of Machine Learning Classifiers
21. Gradient Descent Optimization Path Visualization in Machine
Learning
22. Clustering Evaluation Using Silhouette Score and Visualization
23. Comparison of Min-Max Scaling and Standardization in Machine
Learning
24. Visualizing Model Learning Curves in Python
25. Class Distribution Visualization Before and After Applying SMOTE
26. Visualizing Feature Importance in Machine Learning Models
27. Understanding Overfitting and Underfitting with Learning Curves in
Machine Learning
Chapter 3 For advanced
1. DBSCAN Clustering and Visualization for Customer Segmentation
2. Visualizing the Effectiveness of Hyperparameter Tuning in Machine
Learning Models
3. Bias-Variance Tradeoff in Predictive Modeling
4. Visualizing the Impact of Data Augmentation on Image Classification
5. Visualizing Multicollinearity Using VIF in a Marketing Dataset
6. Feature Selection Techniques and Their Impact on Model
Performance
7. Time Series Decomposition for Sales Forecasting
8. Building and Visualizing a Movie Recommendation System using
Collaborative Filtering
9. Exploratory Data Analysis Using Facet Grids for Customer
Segmentation
10. Creating Synthetic Data for Predicting House Prices
11. Understanding the Impact of Evaluation Metrics on Model Selection
12. Outlier Detection Techniques in Customer Data
13. Visualizing Cross-Validation Folds in Machine Learning
14. Visualizing Clustering Results with Centroids in Python
15. Visualizing Data Drift in Machine Learning Models with Feature
Comparison
16. Visualizing the Impact of Feature Engineering on Classification
Model Performance
17. Creating a Custom Metric and Visualizing Model Performance
18. Visualizing Class Imbalance in Binary Classification with Python
19. Visualizing Predictions vs Actual Values Using Linear Regression
20. Visualizing Feature Scaling Techniques for Machine Learning
21. Creating a Simple Recommender System Visualization Based on
Customer Preferences
22. Evaluating Clustering Performance Using Davies-Bouldin Index
23. Visualizing Data Pipeline Workflow with Machine Learning
Integration
24. Visualizing Feature Interactions in a Predictive Model
25. Visualizing the Impact of Regularization Techniques on Overfitting
in a Machine Learning Model
26. Ensemble Learning: Visualizing Decision Boundaries in Bagging
and Boosting Methods
27. Visualizing ROC AUC for Multi-Class Classification Using Python
28. Visualizing Word Embeddings Using PCA
29. Visualizing Sequential Data using Recurrent Neural Networks
(RNNs)
30. Visualizing the Impact of Activation Functions on Neural Network
Output
31. Visualizing Forecasting Models Using Linear Regression and Time
Series Data
32. Visualizing the Impact of Dropout in a Neural Network for Customer
Churn Prediction
33. Visualizing Error Distributions in Machine Learning Models
34. Visualizing Feature Distributions After Scaling Transformation
35. Visualizing Customer Decision-Making with Decision Trees
36. Evaluating Regression Models with Residual Plots to Improve
Predictions
37. PCA and Biplots for Customer Data Analysis
38. Visualizing Neural Network Training History Using Matplotlib
39. Building and Visualizing a Multi-Output Regression Model for
Predicting House Prices and Rental Rates
40. Comparative Performance of Machine Learning Algorithms on a
Sales Prediction Dataset
41. Evaluating Clustering Algorithm Performance on Customer
Segmentation
42. Visualizing Decision Thresholds in Binary Classification
43. Visualizing Feature Importances in Decision Tree Models
44. Visualizing Confusion Matrices for Multi-Class Classification
Problems
45. Visualizing the Distribution of Customer Product Preferences
46. Visualizing Steps in a Machine Learning Data Pipeline
47. Comparing Model Performance Metrics: A Practical Approach
48. Visualizing the Effect of Noise on Model Performance in Machine
Learning
49. Creating a Classifier Using Ensemble Methods with Visualization
50. Visualizing Data Generation for Binary Classification Problems
51. Visualizing Time Series Model Performance Using Python
52. Visualizing Text Classification Results Using a Confusion Matrix
53. Impact of Sample Size on Model Accuracy in Predicting Customer
Purchases
54. Visualizing Learning Curves for Machine Learning Models
55. Visualizing the Sensitivity of Machine Learning Model Predictions
56. Visualizing the Impact of Feature Engineering on Classification
Performance
57. Creating and Visualizing a Neural Network for Customer Image
Classification
58. Visualizing Aggregated Predictions from Multiple Machine Learning
Models
59. Visualizing the Relationship Between Features and a Target in a
Customer Churn Problem
60. Visualizing Model Training Accuracy and Loss Over Time
61. Visualizing Residuals in a Regression Model for Business Sales
Prediction
62. Visualizing K-Means Clustering with Customer Purchase Data
63. Visualizing Time Series Predictions Using Machine Learning
64. Visualizing Model Predictions for House Price Prediction
65. Visualizing Outliers with Boxplots in Machine Learning Data
66. Visualizing Feature Importance in a Customer Churn Prediction
Model
67. Visualizing Reinforcement Learning Agent's Training Performance
in a Maze
68. Visualizing the Impact of Different Kernel Functions in SVM
69. Visualizing Data Preprocessing in Machine Learning
70. Visualizing the Impact of Hyperparameters on a Decision Tree
Classifier
71. Detecting Anomalies in Time Series Data Using Machine Learning
72. Visualizing the Voting Classifier for Customer Churn Prediction
73. Visualizing the Impact of Train-Test Splits in Machine Learning
74. Using Violin Plots to Visualize Categorical Data in Customer
Satisfaction
75. Visualizing Feature Selection Process in Machine Learning
76. Visualizing Hyperparameter Optimization for a Customer
Classification Model
77. Visualizing Model Performance Across Data Subsets in a Sales
Prediction Scenario
78. Visualizing the Impact of Neural Network Layers on Model
Accuracy
79. Visualizing the Gradient Boosting Process for Customer Churn
Prediction
80. Visualizing the Outputs of a Simple Recurrent Neural Network
(RNN) for Time-Series Forecasting
81. Visualizing Data Transformation in Machine Learning
82. Visualizing the Impact of Feature Scaling in Machine Learning
83. Visualizing Cross-Validation Score Distributions in Machine
Learning Models
84. Visualizing Algorithmic Complexity in Machine Learning
85. Visualizing the Impact of Hyperparameter Tuning on Model
Accuracy
86. Visualizing a Decision Tree for Customer Churn Prediction
87. Visualizing Forecasting Errors Using Machine Learning
88. Visualizing the Results of a Machine Learning Classification Task
Chapter 4 Request for review evaluation
Appendix: Execution Environment
Chapter 1 Introduction
1. Purpose

This book is designed for those who already have a basic understanding of
programming and want to dive into Python-based machine learning through
hands-on practice.
With 100 targeted exercises, it provides a structured approach to developing
and refining your skills.Each exercise includes clear source code and visual
output, making it easier to grasp complex concepts.
Detailed explanations accompany every solution, helping you to not only
see how the code works but also why it works.Whether you're on your
commute or have a few spare moments, simply reading through the
exercises can expand your knowledge.
Running the code yourself will deepen your understanding further.
This format is ideal for anyone looking to strengthen their grasp of machine
learning by actively working through problems and solutions.Enjoy the
journey as you explore Python and machine learning through practical,
visual examples.
2. About the Execution Environment for Source
Code
For information on the execution environment used for the source code in
this book, please refer to the appendix at the end of the book.
Chapter 2 For beginners
1. Simple Linear Regression with Synthetic Data to
Predict Product Pricing
Importance★★★★★
Difficulty★★★☆☆
A company is analyzing the relationship between advertising costs and
product prices to determine if there is a correlation that could help optimize
pricing strategies.Your task is to build a simple linear regression model to
predict product prices based on advertising spending.Using the synthetic
data provided in the code, you need to:Create synthetic data representing
advertising costs as the independent variable (X) and product prices as the
dependent variable (y).Fit a simple linear regression model.Visualize the
data points and the linear regression line.
【Data Generation Code Example】
import numpy as np

import matplotlib.pyplot as plt

#### Create synthetic data

X = np.linspace(100, 1000, 50) # Advertising costs

y = 5 * X + np.random.randn(50) * 200 # Product prices with noise


【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

#### Generate synthetic data

X = np.linspace(100, 1000, 50) # Advertising costs

y = 5 * X + np.random.randn(50) * 200 # Product prices with noise

#### Reshape X for sklearn


X = X.reshape(-1, 1)

#### Fit a linear regression model

model = LinearRegression()

model.fit(X, y)

#### Predict values

y_pred = model.predict(X)

#### Plot the data points and regression line

plt.scatter(X, y, label='Actual Data')

plt.plot(X, y_pred, color='red', label='Regression Line')

plt.title('Advertising Costs vs. Product Prices')

plt.xlabel('Advertising Costs ($)')

plt.ylabel('Product Prices ($)')


plt.legend()

plt.show()

In this exercise, we are performing a simple linear regression to model the


relationship between advertising costs (independent variable) and product
prices (dependent variable).Linear regression helps in predicting the value
of the dependent variable based on one or more independent variables by
fitting a straight line through the data.The steps in this process include:‣
First, synthetic data is created using NumPy. We generate advertising costs
in the range of 100 to 1000 using np.linspace. This method gives 50 evenly
spaced values.‣ We then create the product prices as a linear function of
advertising costs with some added random noise (np.random.randn(50) *
200). The noise ensures that the data isn't perfectly linear, simulating real-
world variability.‣ The next step is to reshape the independent variable X
because scikit-learn expects input variables in a 2D array format, even if it's
just a single feature. We do this using .reshape(-1, 1).‣ The
LinearRegression model from sklearn.linear_model is then fitted to the data
using the .fit() method. This trains the model by finding the optimal
coefficients for the line that minimizes the difference between predicted and
actual product prices.‣ After fitting, we use .predict() to generate the
predicted product prices (y_pred) based on the model. This gives us the
values for the regression line.‣ Finally, we visualize the results. We use
plt.scatter() to plot the actual data points and plt.plot() to draw the
regression line. The labels, title, and legend help explain the plot.This task
reinforces concepts of supervised learning, data visualization, and model
evaluation using synthetic data, an important first step in understanding
linear models.
【Trivia】
Did you know? The origin of linear regression dates back to the 19th
century, with Sir Francis Galton pioneering the concept while studying the
correlation between human traits like height and intelligence. Linear
regression is now one of the foundational techniques in machine learning!
2. Logistic Regression Visualization for Predicting
Customer Purchase
Importance★★★★☆
Difficulty★★★☆☆
A company wants to predict whether a customer will purchase a product
based on the number of visits they make to the company website. You are
tasked with building a logistic regression model to help them visualize the
probability of a customer making a purchase depending on the number of
visits.Use logistic regression to predict and plot the probability that a
customer makes a purchase (binary classification: purchase = 1, no
purchase = 0) based on the number of website visits.Generate a dataset
where the number of website visits varies between 1 and 10, and create a
logistic regression model to predict the likelihood of purchase. Plot the
decision boundary and the probability curve using Matplotlib.
【Data Generation Code Example】
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

# Generate sample data

X = np.arange(1, 11).reshape(-1, 1)

y = np.array([0, 0, 0, 1, 0, 1, 1, 1, 1, 1])
【Diagram Answer】

【Code Answer】
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

import matplotlib.pyplot as plt

# Generate sample data

X = np.arange(1, 11).reshape(-1, 1)

y = np.array([0, 0, 0, 1, 0, 1, 1, 1, 1, 1])
# Initialize and train logistic regression model

model = LogisticRegression()

model.fit(X, y)

# Predict probabilities

X_test = np.linspace(1, 10, 100).reshape(-1, 1)

y_prob = model.predict_proba(X_test)[:, 1]

# Plot the data and probability curve

plt.scatter(X, y, color='red', label='Data Points')

plt.plot(X_test, y_prob, label='Logistic Regression Curve')

plt.xlabel('Number of Website Visits')

plt.ylabel('Probability of Purchase')

plt.title('Customer Purchase Probability Based on Website Visits')


plt.legend()

plt.show()

Logistic regression is a supervised learning algorithm that predicts the


probability of a binary outcome. In this example, the binary outcome is
whether a customer makes a purchase (1) or not (0), and the input feature is
the number of website visits.The logistic regression model maps the input
(number of visits) to a probability between 0 and 1 using the sigmoid
function, making it ideal for binary classification problems.In this exercise,
the data consists of a set of customer visit numbers (ranging from 1 to 10)
and whether they made a purchase. The logistic regression model is fitted to
this data using the fit() function, which adjusts the model coefficients to
best describe the relationship between website visits and the likelihood of a
purchase.Once trained, the model can be used to predict the probability that
a customer will make a purchase for any given number of website visits.
This prediction is made using the predict_proba() function, which returns
the probability of each class (in this case, "purchase" or "no purchase").The
results are then visualized using Matplotlib. The red scatter plot points
show the original data points, while the smooth curve represents the logistic
regression's prediction of the probability of purchase based on the number
of visits. The curve demonstrates how logistic regression maps the number
of website visits to a probability between 0 and 1, providing a clear visual
understanding of how the model works.
【Trivia】
Logistic regression is often used not just in marketing and customer
behavior prediction but also in medical fields to predict the likelihood of
diseases, in finance to detect fraudulent activities, and in many other
domains where binary decisions need to be modeled.
3. K-Means Clustering: Customer Segmentation
Problem
Importance★★★★☆
Difficulty★★★☆☆
A retail company wants to understand the behavior of its customers based
on their purchase patterns.You are tasked with segmenting the customers
into different groups based on their annual spending on two product
categories: 'Electronics' and 'Furniture'.Using K-Means clustering, create a
model to group these customers into 3 clusters. Visualize the clusters to
gain insights into different customer groups.The data should be generated
within the code itself, containing 2 features ('Electronics' spending and
'Furniture' spending).
【Data Generation Code Example】
import numpy as np

np.random.seed(42)

X = np.random.rand(100, 2) * 10000 # Generate spending data


【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

# Create random data

np.random.seed(42)

X = np.random.rand(100, 2) * 10000 # 100 samples with two features

# K-Means clustering
kmeans = KMeans(n_clusters=3)

clusters = kmeans.fit_predict(X)

# Visualization

plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],


s=300, c='red', marker='x', label='Centroids')

plt.title('Customer Segmentation using K-Means')

plt.xlabel('Annual Spending on Electronics ($)')

plt.ylabel('Annual Spending on Furniture ($)')

plt.legend()
plt.show()

The objective of this exercise is to apply K-Means clustering to group


customers based on their spending patterns in two product categories:
Electronics and Furniture.In this example, we generate a dataset using
np.random.rand, which creates 100 random data points scaled to represent
annual spending amounts in two categories (Electronics and Furniture).The
KMeans algorithm from the sklearn.cluster module is used to divide the
customers into three clusters. The K-Means algorithm minimizes the
variance within each cluster by iterating between assigning points to
clusters and adjusting the cluster centers.First, the model is fitted on the
generated data, and it assigns each customer to one of the three clusters.The
results are visualized using matplotlib. The scatter plot shows customers as
points in the two-dimensional space of spending on Electronics and
Furniture, and each cluster is color-coded. The red 'x' markers represent the
cluster centroids, which are the average locations of customers in each
group.This visualization helps us understand customer segmentation by
identifying patterns in their spending behavior. The K-Means algorithm is
widely used in customer segmentation tasks to target marketing strategies,
personalize offers, or optimize services.
【Trivia】
‣ K-Means is sensitive to the initialization of cluster centroids, which can
lead to different results depending on the random seed.‣ One common
method to avoid this is to run K-Means multiple times with different
initializations and choose the best clustering solution based on inertia.‣ The
"Elbow Method" is often used to determine the optimal number of clusters
by plotting the sum of squared distances from each point to its cluster
center.
4. Decision Tree Classification for Customer Churn
Prediction
Importance★★★★★
Difficulty★★★☆☆
You are working for a telecommunications company that is facing issues
with customer churn.Churn occurs when customers stop using the
company's service.The company wants to predict which customers are
likely to churn so that they can take proactive steps to retain them.Your task
is to build a Decision Tree classification model to predict customer churn
based on features like age, monthly charges, and contract duration.You
should generate a simple dataset, train the model, and visualize the decision
tree to understand how the classifier works.Use the provided data structure
to train and evaluate the model.
【Data Generation Code Example】
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

# # Generate synthetic dataset with age, monthly charges, contract


duration, and churn outcome

data = pd.DataFrame({

'age': np.random.randint(18, 70, 100),

'monthly_charges': np.random.uniform(20, 150, 100),

'contract_duration': np.random.randint(1, 36, 100),

'churn': np.random.choice([0, 1], 100)

})
X = data[['age', 'monthly_charges', 'contract_duration']]

y = data['churn']

# # Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)
【Diagram Answer】

【Code Answer】
# # Import necessary libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier, plot_tree

# # Generate synthetic dataset for customer churn analysis

data = pd.DataFrame({

'age': np.random.randint(18, 70, 100),


'monthly_charges': np.random.uniform(20, 150, 100),

'contract_duration': np.random.randint(1, 36, 100),

'churn': np.random.choice([0, 1], 100)

})

X = data[['age', 'monthly_charges', 'contract_duration']]

y = data['churn']

# # Split dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

# # Initialize and train a decision tree classifier

clf = DecisionTreeClassifier(random_state=42)

clf.fit(X_train, y_train)

# # Plot the decision tree

plt.figure(figsize=(12, 8))

plot_tree(clf, feature_names=['age', 'monthly_charges',


'contract_duration'], class_names=['No Churn', 'Churn'], filled=True)

plt.show()

This exercise focuses on using a Decision Tree classifier to solve a


customer churn prediction problem.The Decision Tree algorithm is a
supervised learning method used for classification tasks.In this case, the
features—age, monthly_charges, and contract_duration—are used to predict
whether a customer will churn (0 for no churn, 1 for churn).The data is
generated randomly with specified ranges:‣ age is a random integer
between 18 and 70.‣ monthly_charges is a floating-point number between
20 and 150.‣ contract_duration is an integer between 1 and 36 months.‣
The target variable churn is randomly assigned as 0 or 1, indicating whether
the customer churned.After generating the dataset, we split it into training
and testing sets.The model is trained using the training set, and then we use
the plot_tree function to visualize the decision tree structure.The tree
visualization helps us interpret how the model makes decisions.The
decision tree splits the data based on the features to classify whether a
customer is likely to churn or not.The tree is plotted with plot_tree, where
the feature names (age, monthly_charges, and contract_duration) and class
names (No Churn, Churn) are used to make the tree interpretable.The
filled=True argument colors the nodes according to the majority class in
that node, making it easy to understand the classification boundaries.By
visualizing the tree, it becomes clear how the algorithm splits the data at
each level and how it arrives at the classification decision based on the
features.
【Trivia】
The Decision Tree algorithm can handle both categorical and continuous
features, making it versatile for various classification tasks.It uses a greedy
algorithm to make splits that reduce the impurity (often measured by Gini
impurity or entropy) of the resulting subsets.
5. Random Forest Feature Importance Analysis for
Customer Churn Prediction
Importance★★★★☆
Difficulty★★★☆☆
You are working as a data scientist at a telecommunications company, and
your task is to predict customer churn (whether a customer will leave the
service) using various customer data.The company wants to understand
which features (e.g., age, monthly charges, contract type) are the most
important in predicting churn.Use a Random Forest Classifier to fit the
customer data and create a feature importance plot that visually shows the
most influential features in predicting customer churn.To help you get
started, you are provided with a synthetic dataset that contains features such
as customer age, monthly charges, contract type, and churn (0 for staying, 1
for leaving).Generate the dataset and perform the analysis.
【Data Generation Code Example】
import numpy as np

import pandas as pd

np.random.seed(42)

#Create synthetic customer data

n_customers = 1000

age = np.random.randint(18, 70, n_customers)

monthly_charges = np.random.uniform(20, 120, n_customers)

contract_type = np.random.choice(['Month-to-month', 'One year', 'Two


year'], n_customers)

churn = np.random.choice([0, 1], n_customers, p=[0.8, 0.2])

#Convert contract type to numerical


contract_type_num = [0 if c == 'Month-to-month' else 1 if c == 'One year'
else 2 for c in contract_type]

#Create DataFrame

data = pd.DataFrame({'Age': age, 'MonthlyCharges': monthly_charges,


'ContractType': contract_type_num, 'Churn': churn})
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

np.random.seed(42)

#Generate synthetic customer data

n_customers = 1000
age = np.random.randint(18, 70, n_customers)

monthly_charges = np.random.uniform(20, 120, n_customers)

contract_type = np.random.choice(['Month-to-month', 'One year', 'Two


year'], n_customers)

churn = np.random.choice([0, 1], n_customers, p=[0.8, 0.2])

contract_type_num = [0 if c == 'Month-to-month' else 1 if c == 'One year'


else 2 for c in contract_type]

data = pd.DataFrame({'Age': age, 'MonthlyCharges': monthly_charges,


'ContractType': contract_type_num, 'Churn': churn})

#Define features and target

X = data[['Age', 'MonthlyCharges', 'ContractType']]

y = data['Churn']

#Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

#Train a Random Forest Classifier

clf = RandomForestClassifier(n_estimators=100, random_state=42)

clf.fit(X_train, y_train)

#Get feature importances

importances = clf.feature_importances_

features = X.columns

#Plot feature importances

plt.figure(figsize=(8, 6))
plt.barh(features, importances, color='skyblue')

plt.xlabel('Feature Importance')

plt.title('Random Forest Feature Importance for Customer Churn


Prediction')

plt.show()

In this exercise, we use a Random Forest Classifier to identify the most


important features that influence customer churn.
We begin by creating a synthetic dataset that simulates customer data.
The dataset includes customer age, monthly charges, and contract type
(converted to numeric values). The target variable is whether the customer
has churned or not (binary classification).
Once the dataset is ready, the features (X) and the target variable (y) are
separated. We then split the data into training and testing sets using
train_test_split, reserving 30% of the data for testing the model's accuracy.
Next, we train a Random Forest Classifier. Random Forest is an ensemble
machine learning algorithm that builds multiple decision trees and merges
their outputs for more accurate predictions. After fitting the model to the
training data, we extract the feature importances.
Feature importance indicates how useful or valuable each feature is in
constructing the decision trees within the random forest. In this case, it
helps us understand which features contribute the most to predicting
customer churn. The feature importances are then plotted as a horizontal bar
chart, making it easy to visualize and interpret which features play the most
significant role.
This exercise helps reinforce the use of Random Forest for classification
problems and the importance of visualizing feature importance to gain
insights from the model.
【Trivia】
Random Forest is less prone to overfitting compared to decision trees
because it averages multiple decision trees to make predictions. The
concept of bagging (bootstrap aggregating) helps in reducing variance,
leading to more stable and reliable models.
6. Support Vector Machine Classification and
Visualization Problem
Importance★★★★★
Difficulty★★★☆☆
A retail company wants to predict whether a customer will purchase a
premium product based on specific customer attributes such as age, income,
and spending behavior.The company has gathered data, and now your task
is to build a Support Vector Machine (SVM) model to classify potential
customers into two categories: "Premium Purchaser" and "Non-Premium
Purchaser."You need to:Create a dataset representing customer information
with age and income as features.Use an SVM classifier to separate the two
classes visually.Display a 2D plot showing the decision boundary generated
by the SVM and the customer data points.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_blobs

## Create customer data using blobs to simulate "Premium Purchasers"


and "Non-Premium Purchasers"

X, y = make_blobs(n_samples=100, centers=2, random_state=6,


cluster_std=2.5)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

from sklearn.svm import SVC

## Create synthetic customer data

X, y = make_blobs(n_samples=100, centers=2, random_state=6,


cluster_std=2.5)

## Initialize an SVM model


model = SVC(kernel='linear')

model.fit(X, y)

## Define function to plot decision boundary

def plot_decision_boundary(X, y, model):

plt.scatter(X[:, 0], X[:, 1], c=y, cmap='autumn')

ax = plt.gca()

xlim = ax.get_xlim()

ylim = ax.get_ylim()

## Create grid to evaluate the model

xx = np.linspace(xlim[0], xlim[1], 30)

yy = np.linspace(ylim[0], ylim[1], 30)

YY, XX = np.meshgrid(yy, xx)

xy = np.vstack([XX.ravel(), YY.ravel()]).T

Z = model.decision_function(xy).reshape(XX.shape)

## Plot decision boundary and margins

ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,


linestyles=['--', '-', '--'])

## Plot support vectors

ax.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1],


s=100, linewidth=1, facecolors='none', edgecolors='k')

plt.title('SVM Classification Decision Boundary')

plt.xlabel('Customer Age')
plt.ylabel('Customer Income')

plt.show()

## Call the plot function

plot_decision_boundary(X, y, model)

This task focuses on building a simple Support Vector Machine (SVM)


classification model to separate two classes of data. In this case, the classes
represent two types of customers: those likely to purchase premium
products and those who are not. The SVM algorithm works by finding the
optimal hyperplane (a line in this 2D case) that separates the two classes
with the largest possible margin. This separation is represented visually as a
decision boundary.The code begins by generating synthetic data using
make_blobs, which creates two distinct clusters of data points, simulating
customer attributes (age and income). The labels (y) represent whether each
customer belongs to the premium purchaser or non-premium purchaser
class. The SVC() class from the sklearn.svm module is then used to create a
linear SVM model. The fit() method is used to train the SVM model with
the generated customer data (X and y).The plot_decision_boundary
function creates a grid of points and calculates the decision function values,
which determine the boundary between the two classes. The contours are
drawn to show the decision boundary (solid line) and the margins (dashed
lines), where support vectors (critical points that influence the boundary)
are also highlighted.The final plot displays:Decision Boundary: The line
separating the two classes.Support Vectors: Points that lie closest to the
boundary and define the margins of the separation.Customer Data: Plotted
points in the 2D feature space (age and income).This exercise helps
beginners understand how an SVM works and how to visualize its decision-
making process.
【Trivia】
Support Vector Machines are particularly effective in high-dimensional
spaces and are commonly used in text classification tasks such as spam
detection.
7. Principal Component Analysis with Visualization
for Customer Data Analysis
Importance★★★★★
Difficulty★★★☆☆
A marketing company has gathered customer data from various surveys.
This data contains information on customer preferences for different
products.Your goal is to reduce the dimensionality of this data using
Principal Component Analysis (PCA) to help identify key trends that could
inform future marketing strategies.After reducing the data's dimensionality,
visualize the first two principal components in a scatter plot.Your scatter
plot should clearly show how the customers group based on their
preferences.Use the data provided in the code to perform the PCA.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_classification

# # Generate a dataset with customer preferences


np.random.seed(0)

X, _ = make_classification(n_samples=100, n_features=5,
n_informative=3, n_classes=3)
【Diagram Answer】

【Code Answer】
import numpy as np

from sklearn.datasets import make_classification

from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

# # Generate a dataset with customer preferences

np.random.seed(0)

X, _ = make_classification(n_samples=100, n_features=5,
n_informative=3, n_classes=3)
# # Perform PCA to reduce dimensions

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X)

# # Create a scatter plot of the PCA result

plt.scatter(X_pca[:, 0], X_pca[:, 1])

plt.title('PCA Scatter Plot: Customer Preferences')

plt.xlabel('Principal Component 1')

plt.ylabel('Principal Component 2')

plt.grid(True)

plt.show()

Principal Component Analysis (PCA) is a dimensionality reduction


technique used to simplify large datasets while preserving important
information.In machine learning, this is useful when dealing with many
features, as it reduces complexity and helps visualize the data in fewer
dimensions.Here, we first generate synthetic customer data using
make_classification, a function that creates a dataset for classification tasks
with informative features.We then use the PCA algorithm from the
sklearn.decomposition module, which reduces the dataset's dimensionality
to two principal components.These components are linear combinations of
the original features that capture the maximum variance in the data.The first
principal component explains the most variance, and the second explains
the next most variance.Finally, we use matplotlib.pyplot to visualize the
reduced dataset, plotting the two principal components on a scatter plot.The
customers are grouped based on their preferences, which could reveal
potential trends or clusters within the data.This visualization is crucial for
identifying patterns that can inform marketing strategies.
【Trivia】
PCA was invented by Karl Pearson in 1901 as a method of data
compression and interpretation.It is widely used in areas like image
compression, finance, and even bioinformatics for analyzing large datasets.
8. Confusion Matrix Visualization in Model
Evaluation
Importance★★★★★
Difficulty★★★☆☆
A client working for an e-commerce company has developed a machine
learning model to predict whether customers will buy a product based on
their online behavior.The goal is to evaluate the model's performance by
analyzing the predictions.The client has requested that you visualize the
confusion matrix to better understand the number of correct and incorrect
predictions (true positives, false positives, true negatives, and false
negatives).Using a simple classification model, such as a decision tree, train
a model with generated data, predict the outcomes, and plot the confusion
matrix.Your task is:Create synthetic data for this classification problem,
using features like 'time spent on the website' and 'number of items viewed'
as inputs and a binary target variable 'purchase' (0 for no purchase, 1 for
purchase).Train a model on this data.Predict using the model.Plot and
display the confusion matrix.
【Data Generation Code Example】
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100, n_features=2,
n_informative=2, n_redundant=0, random_state=42)
【Diagram Answer】

【Code Answer】
#import necessary libraries

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

import matplotlib.pyplot as plt

#create data
X, y = make_classification(n_samples=100, n_features=2,
n_informative=2, n_redundant=0, random_state=42)

#split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

#train a decision tree classifier

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

#make predictions on the test data

y_pred = model.predict(X_test)

#generate confusion matrix

cm = confusion_matrix(y_test, y_pred)

#plot the confusion matrix

ConfusionMatrixDisplay(confusion_matrix=cm).plot()

plt.title('Confusion Matrix')

plt.show()

The goal of this task is to provide a deeper understanding of how model


evaluation works using a confusion matrix, which is an essential tool in
classification problems.In machine learning, confusion matrices provide a
detailed breakdown of how a classification model performs by showing the
number of true positives (correctly predicted 1s), true negatives (correctly
predicted 0s), false positives (incorrectly predicted 1s), and false negatives
(incorrectly predicted 0s).This breakdown is more informative than just
accuracy, which might not capture the full picture, especially with
imbalanced data.The process begins with generating synthetic data. We use
the make_classification() function to simulate a binary classification
problem, producing two features. This allows us to have a realistic dataset
without requiring external files.Next, we split this dataset into training and
testing subsets using train_test_split(), so the model can learn from one part
of the data and be evaluated on another.A simple decision tree classifier
(DecisionTreeClassifier) is then trained using the training data (X_train,
y_train). This classifier learns to predict whether a customer will make a
purchase based on the provided features.After training the model, we
predict the target labels for the testing set (X_test) using the predict()
method.To evaluate the model’s performance, we use the
confusion_matrix() function to calculate the confusion matrix, which tells
us how many predictions were correct and how many were wrong.Finally,
we visualize this confusion matrix using ConfusionMatrixDisplay() and
matplotlib to generate a clear, easy-to-understand plot of the matrix.This
exercise not only teaches how to visualize confusion matrices but also
covers the end-to-end process of data creation, model training, prediction,
and evaluation, which are all critical skills in machine learning.
【Trivia】
The confusion matrix was introduced as a concept by a British statistician
and biologist, Karl Pearson, in 1904.
9. ROC Curve Analysis for Binary Classifier
Performance
Importance★★★★☆
Difficulty★★★☆☆
You work as a data scientist for an e-commerce company that wants to
improve its fraud detection system. The current system classifies
transactions as either "fraudulent" or "non-fraudulent".You have developed
a new machine learning model and now need to evaluate its performance by
comparing it with the previous model using a Receiver Operating
Characteristic (ROC) curve.Your task is to visualize and compare the ROC
curves of two classifiers: the old model and the new model, using synthetic
data for simplicity.Generate synthetic data representing the true labels and
predicted probabilities for both models. Then, plot the ROC curve for each
model and compute the Area Under the Curve (AUC).
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_curve, roc_auc_score

X, y = make_classification(n_samples=1000, n_features=20, n_classes=2,


random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

clf_old = LogisticRegression(random_state=42)

clf_old.fit(X_train, y_train)
y_prob_old = clf_old.predict_proba(X_test)[:, 1]

clf_new = LogisticRegression(random_state=24)

clf_new.fit(X_train, y_train)

y_prob_new = clf_new.predict_proba(X_test)[:, 1]
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_curve, roc_auc_score

# # Generate synthetic classification data


X, y = make_classification(n_samples=1000, n_features=20, n_classes=2,
random_state=42)

# # Split the dataset into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

# # Train the old model

clf_old = LogisticRegression(random_state=42)

clf_old.fit(X_train, y_train)

# # Get the predicted probabilities from the old model

y_prob_old = clf_old.predict_proba(X_test)[:, 1]

# # Train the new model

clf_new = LogisticRegression(random_state=24)

clf_new.fit(X_train, y_train)

# # Get the predicted probabilities from the new model

y_prob_new = clf_new.predict_proba(X_test)[:, 1]

# # Compute ROC curve and AUC for the old model

fpr_old, tpr_old, _ = roc_curve(y_test, y_prob_old)

auc_old = roc_auc_score(y_test, y_prob_old)

# # Compute ROC curve and AUC for the new model

fpr_new, tpr_new, _ = roc_curve(y_test, y_prob_new)

auc_new = roc_auc_score(y_test, y_prob_new)

# # Plot both ROC curves


plt.figure(figsize=(8, 6))

plt.plot(fpr_old, tpr_old, label=f'Old Model (AUC = {auc_old:.2f})')

plt.plot(fpr_new, tpr_new, label=f'New Model (AUC = {auc_new:.2f})')

# # Plot diagonal line representing random guessing

plt.plot([0, 1], [0, 1], color='navy', linestyle='--')

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('Receiver Operating Characteristic (ROC) Curve')

plt.legend(loc='lower right')

plt.show()

This problem involves evaluating and comparing two machine learning


classifiers' performance using the ROC curve.A ROC curve is a graphical
representation of a model's performance, particularly for binary
classification tasks. It plots the true positive rate (TPR) against the false
positive rate (FPR) at various threshold settings.Here, the goal is to
compare the old fraud detection model and the new model to determine
which performs better based on the ROC curve and the Area Under the
Curve (AUC) score.In the code, we start by generating synthetic binary
classification data using the make_classification function. The data
represents 1000 samples with 20 features, where the target variable has two
possible classes (fraudulent and non-fraudulent).The dataset is then split
into training and testing sets using train_test_split. We train two logistic
regression models—one representing the old model and another
representing the new model—on the training data.The predict_proba
method of the models returns the predicted probabilities for the positive
class (fraudulent transactions), which we store in y_prob_old and
y_prob_new.The roc_curve function computes the FPR and TPR values
needed to plot the ROC curve for both models, while the roc_auc_score
function computes the AUC values, which summarize the overall
performance of each model.An AUC score of 1 indicates perfect
classification, while a score of 0.5 implies no predictive power (random
guessing).In the plot, we visualize both the old and new models' ROC
curves, and include a dashed diagonal line representing a random classifier.
The closer the ROC curve is to the top-left corner, the better the model is at
distinguishing between the two classes.By comparing the curves and AUC
values, we can assess which model performs better in detecting fraudulent
transactions.
【Trivia】
The ROC curve was initially developed during World War II to detect
enemy aircraft using radar. Today, it is widely used in machine learning to
evaluate classifier performance across a range of decision thresholds.
10. Time Series Data Generation and Basic Line
Plot for Machine Learning
Importance★★★★☆
Difficulty★★☆☆☆
A company wants to forecast its monthly sales performance based on
historical data. They want to analyze trends over the past 12 months and
visualize this data.Generate random monthly sales data for a full year (12
months), plot the data, and display the trend using a line plot.The company
expects a simple visualization but needs to consider this task as part of their
larger machine learning pipeline for trend analysis.You will generate
synthetic data for sales, ranging between 100 and 1000 units for each
month.Plot this data and ensure the graph displays the months on the x-axis
and sales units on the y-axis.
【Data Generation Code Example】
import numpy as np

import pandas as pd

# Generating 12 months of sales data between 100 and 1000

months = np.arange(1, 13)

sales = np.random.randint(100, 1001, size=12)


【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

# Generating 12 months of sales data between 100 and 1000

months = np.arange(1, 13)

sales = np.random.randint(100, 1001, size=12)

# Plotting the sales data


plt.plot(months, sales, marker='o', linestyle='-', color='b', label='Sales')

plt.xlabel('Months') # X-axis label

plt.ylabel('Sales (Units)') # Y-axis label

plt.title('Monthly Sales Trend') # Title of the graph

plt.xticks(months) # Ensuring all months are shown on x-axis

plt.legend() # Displaying legend

plt.grid(True) # Adding a grid

plt.show() # Displaying the plot

This task involves generating random time-series data for monthly sales and
visualizing it using a line plot.First, numpy is used to create an array of 12
months and randomly generate sales data for each month between 100 and
1000. This synthetic data mimics real-world sales trends.We then use the
matplotlib library to plot the data. The plot() function creates the line chart,
where months are plotted on the x-axis and sales on the y-axis.The
marker='o' option adds circles at each data point, making it easier to see
individual values. The label argument provides a label for the data line,
which is later shown using plt.legend().Additionally, the xlabel() and
ylabel() functions add meaningful axis labels to the chart, and the title()
function adds a title.The xticks() ensures that all months from 1 to 12 are
displayed on the x-axis. The grid(True) adds a grid to the plot, making it
easier to interpret the data visually. Finally, show() is used to display the
resulting chart.In machine learning, such visualizations are crucial for
understanding the underlying patterns in time-series data before model
training. Trends, seasonal variations, and outliers can be identified easily
with such plots. This step often precedes more complex analyses, like
training predictive models to forecast future sales.
【Trivia】
Line plots are commonly used in time-series analysis because they
effectively visualize changes over continuous intervals.
11. Heatmap of Correlation Matrix for Customer
Sales Data Analysis
Importance★★★★☆
Difficulty★★★☆☆
A retail company wants to understand the relationships between various
sales metrics such as total sales, customer visits, marketing spend, and
advertising channels. You are tasked with analyzing the correlations
between these metrics to help the company optimize its marketing
strategies.Create a heatmap to visualize the correlation matrix of these
metrics using Python, and provide insights into how different variables are
related to one another.The dataset should include the following
columns:Total_Sales: Total revenue generated from sales.Customer_Visits:
Number of customers visiting the store.Marketing_Spend: Amount of
money spent on marketing campaigns.Online_Ads: Spend on online
advertising channels.TV_Ads: Spend on TV advertising channels.Analyze
this dataset and generate a correlation heatmap.
【Data Generation Code Example】
import pandas as pd

import numpy as np

# # Creating a random dataset for the problem

data = {

'Total_Sales': np.random.randint(10000, 100000, 100),

'Customer_Visits': np.random.randint(200, 2000, 100),


'Marketing_Spend': np.random.randint(1000, 10000, 100),

'Online_Ads': np.random.randint(500, 5000, 100),

'TV_Ads': np.random.randint(500, 5000, 100)


}

df = pd.DataFrame(data)
【Diagram Answer】

【Code Answer】
# # Importing necessary libraries

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

# # Creating the dataset


data = {

'Total_Sales': np.random.randint(10000, 100000, 100),

'Customer_Visits': np.random.randint(200, 2000, 100),

'Marketing_Spend': np.random.randint(1000, 10000, 100),

'Online_Ads': np.random.randint(500, 5000, 100),

'TV_Ads': np.random.randint(500, 5000, 100)

df = pd.DataFrame(data)

# # Calculating the correlation matrix


corr_matrix = df.corr()

# # Creating a heatmap to visualize the correlation matrix

plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm",
linewidths=0.5)

plt.title("Heatmap of Correlation Matrix")

plt.show()

In this task, we create a synthetic dataset of various sales and marketing


metrics for a retail company. The goal is to analyze the relationships
between different variables like total sales, customer visits, and advertising
spend on various channels. To achieve this, we first generate random data
for each column and store it in a pandas DataFrame.Once the data is ready,
we compute the correlation matrix using the .corr() function from pandas.
The correlation matrix provides insight into how each variable is related to
others. Positive values near +1 indicate a strong positive correlation, while
values near -1 represent a negative correlation. Values close to 0 suggest no
linear relationship.Next, we create a heatmap using the seaborn library's
heatmap() function, which visually represents the strength of the
correlations between variables. The colors in the heatmap make it easier to
interpret the data by highlighting high and low correlations with different
color intensities. The annot=True argument adds numerical values directly
to the heatmap, providing more precise insights.This process is valuable in
machine learning as correlation analysis helps to identify multicollinearity,
which can impact model performance. Understanding the relationships
between features also assists in selecting the most relevant variables for
building predictive models.
【Trivia】
Did you know that correlation doesn't always imply causation? Even if two
variables show a strong correlation, it doesn't mean that one causes the
other to change. Understanding this distinction is crucial in data analysis,
especially when dealing with real-world business problems.
12. Histogram of Feature Distributions for Machine
Learning Model Inputs
Importance★★★★☆
Difficulty★★★☆☆
A customer in the e-commerce industry wants to analyze the distribution of
customer-related features such as "age," "spending score," and "income
level" to improve their marketing strategy.To aid this, you are tasked with
visualizing the distributions of these features using histograms to gain
insights into the data patterns.Create a dataset of 500 customers with
random "age," "spending score," and "income level."Then, generate
histograms for each feature to examine how the features are distributed
across the customer base.
【Data Generation Code Example】
import numpy as np

import pandas as pd

# # Generate random data for age, spending score, and income level
age = np.random.randint(18, 70, 500)

spending_score = np.random.randint(1, 100, 500)

income_level = np.random.randint(30000, 150000, 500)

# # Create a dataframe

df = pd.DataFrame({'Age': age, 'SpendingScore': spending_score,


'IncomeLevel': income_level})
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

# # Generate random data for age, spending score, and income level

age = np.random.randint(18, 70, 500)

spending_score = np.random.randint(1, 100, 500)

income_level = np.random.randint(30000, 150000, 500)

# # Create a dataframe

df = pd.DataFrame({'Age': age, 'SpendingScore': spending_score,


'IncomeLevel': income_level})

# # Plot histograms for each feature

plt.figure(figsize=(15, 5))

# # Histogram for Age

plt.subplot(1, 3, 1)
plt.hist(df['Age'], bins=15, color='skyblue', edgecolor='black')

plt.title('Age Distribution')

plt.xlabel('Age')

plt.ylabel('Frequency')

# # Histogram for Spending Score

plt.subplot(1, 3, 2)

plt.hist(df['SpendingScore'], bins=15, color='lightgreen',


edgecolor='black')

plt.title('Spending Score Distribution')

plt.xlabel('Spending Score')
plt.ylabel('Frequency')

# # Histogram for Income Level

plt.subplot(1, 3, 3)
plt.hist(df['IncomeLevel'], bins=15, color='salmon', edgecolor='black')

plt.title('Income Level Distribution')

plt.xlabel('Income Level')

plt.ylabel('Frequency')

# # Show the plot

plt.tight_layout()

plt.show()
In this exercise, we are visualizing the distribution of features for a dataset
containing customer information. This process helps in analyzing the
underlying patterns in the data before applying any machine learning
model.Feature distribution is important because it can reveal trends,
outliers, and possible biases within the data, which can directly impact
model performance.Here, we create random data for three customer
features: "age," "spending score," and "income level." The data is generated
using the np.random.randint function to simulate realistic values for these
features.Once the dataset is generated, we use histograms to visualize each
feature. Histograms are useful because they show how often data points fall
into specific ranges, which can be used to identify common values and the
spread of the data.We use Matplotlib’s plt.hist function to generate each
histogram. The bins parameter determines the number of bars (bins) that the
histogram will display, while the color and edgecolor are used to customize
the appearance.The titles and labels make it easier to interpret each plot,
and plt.tight_layout() is used to ensure that the plots do not overlap. Finally,
plt.show() renders the visualizations.Understanding how features are
distributed helps to preprocess data more effectively, such as by
normalizing features that are skewed or removing outliers that could distort
model performance.
【Trivia】
Histograms are one of the oldest statistical tools, and they were first
introduced by Karl Pearson in the late 19th century to represent frequency
distributions.
13. Box Plot Outlier Detection in Customer Sales
Data
Importance★★★★☆
Difficulty★★★☆☆
A retail company wants to analyze its sales data to detect potential outliers
that may affect decision-making. Outliers can indicate fraud, data entry
errors, or special events that may need further investigation.The sales data
includes 1000 records representing daily sales in dollars for a specific
product. Your task is to generate the data, create a box plot for visual outlier
detection, and use Python to detect and visualize the outliers.Please write
code that generates a box plot, identifies the outliers in the sales data, and
displays the results.
【Data Generation Code Example】
import numpy as np

import pandas as pd

# Generate random sales data with potential outliers


sales_data = np.random.normal(1000, 250, 1000).tolist() + [5000, 5500,
6000]

df = pd.DataFrame(sales_data, columns=["Daily Sales"])


【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

# Generate random sales data with potential outliers

sales_data = np.random.normal(1000, 250, 1000).tolist() + [5000, 5500,


6000]

df = pd.DataFrame(sales_data, columns=["Daily Sales"])

# Create the box plot


plt.boxplot(df["Daily Sales"])

plt.title("Box Plot for Outlier Detection in Daily Sales")

plt.ylabel("Sales in Dollars")

plt.show()

# Identify outliers using IQR method

Q1 = df["Daily Sales"].quantile(0.25)

Q3 = df["Daily Sales"].quantile(0.75)

IQR = Q3 - Q1

outliers = df[(df["Daily Sales"] < Q1 - 1.5 * IQR) | (df["Daily Sales"] >


Q3 + 1.5 * IQR)]
print("Detected Outliers:\n", outliers)

In this task, we are detecting outliers in sales data using a box plot, a
graphical tool often used in machine learning and data analysis.First, we
generate synthetic data that mimics real-world sales figures. We simulate
daily sales using a normal distribution with a mean of 1000 dollars and a
standard deviation of 250. We also intentionally add a few extreme values
(5000, 5500, and 6000 dollars) to serve as outliers for this analysis. This
type of data generation helps us create a realistic environment where
outliers are present.
Next, a box plot is drawn to visualize the spread of the data. A box plot
shows the median, quartiles (Q1, Q3), and any points that are far away from
the main body of data (potential outliers). The whiskers in a box plot
typically extend to 1.5 times the interquartile range (IQR), and points
outside this range are considered outliers.
We compute Q1 (the 25th percentile) and Q3 (the 75th percentile) to
calculate the IQR (Q3 - Q1). Any sales figures that fall below Q1 - 1.5 *
IQR or above Q3 + 1.5 * IQR are considered outliers. These are printed
after detection for further analysis.
This approach is commonly used in machine learning to clean and
preprocess data before feeding it into models. Detecting and handling
outliers is critical to avoid misleading results, and the box plot is a simple
yet powerful tool for this purpose.
【Trivia】
Did you know that John Tukey, an American mathematician, invented the
box plot in 1977? It's a key tool in exploratory data analysis and is
especially useful for highlighting outliers in datasets.
14. Data Normalization and Visual Comparison for
Machine Learning
Importance★★★★★
Difficulty★★★☆☆
You are working with a retail company that is analyzing customer purchase
patterns.The company wants to build a machine learning model to predict
future customer spending.However, the raw data contains features with
different scales, such as the number of items purchased (a small number)
and the total spending amount (a large number).The company needs your
help to normalize the data for better machine learning model performance
and to visually compare the impact of normalization.Write a Python script
that:Generates synthetic data for customer purchases, including:Number of
items purchasedTotal spending amountNormalizes the data using
MinMaxScaler and StandardScaler.Plots both the original data and the
normalized data for visual comparison.
【Data Generation Code Example】
import numpy as np

# Create sample data for customer purchases

np.random.seed(42) # for reproducibility

data = np.random.randint(1, 50, size=(100, 2)) # Items purchased, total


spending amount
【Diagram Answer】

【Code Answer】
# Import necessary libraries

import numpy as np

import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Create sample data

np.random.seed(42)

data = np.random.randint(1, 50, size=(100, 2)) # Items purchased, total


spending amount
# Normalize data using MinMaxScaler and StandardScaler

min_max_scaler = MinMaxScaler()

standard_scaler = StandardScaler()

# Apply MinMaxScaler and StandardScaler to the data


data_min_max_scaled = min_max_scaler.fit_transform(data)

data_standard_scaled = standard_scaler.fit_transform(data)
# Plot original and normalized data

plt.figure(figsize=(12, 6))

# Original data plot

plt.subplot(1, 3, 1)

plt.scatter(data[:, 0], data[:, 1], color='blue')

plt.title('Original Data')

plt.xlabel('Items Purchased')

plt.ylabel('Total Spending Amount')

# MinMax Scaled data plot

plt.subplot(1, 3, 2)

plt.scatter(data_min_max_scaled[:, 0], data_min_max_scaled[:, 1],


color='green')

plt.title('MinMax Scaled Data')


plt.xlabel('Items Purchased')

plt.ylabel('Total Spending Amount')

# Standard Scaled data plot

plt.subplot(1, 3, 3)

plt.scatter(data_standard_scaled[:, 0], data_standard_scaled[:, 1],


color='red')

plt.title('Standard Scaled Data')


plt.xlabel('Items Purchased')

plt.ylabel('Total Spending Amount')

plt.tight_layout()

plt.show()

In this exercise, the goal is to understand how to normalize data before


applying machine learning models.This is essential because machine
learning algorithms, especially those based on distance calculations (e.g.,
K-nearest neighbors, neural networks), perform better when features are on
a similar scale.Without normalization, features with larger values can
dominate the model's predictions, leading to bias.The MinMaxScaler
transforms the data by scaling each feature between 0 and 1, which is
especially useful when you want all features to fit within a fixed range.On
the other hand, StandardScaler normalizes the data by removing the mean
and scaling it to unit variance (mean of 0 and variance of 1).This method is
beneficial when you expect your data to follow a normal distribution.We
first generate synthetic data with features representing the number of items
purchased and total spending amounts.We then apply both scalers to the
data and plot three figures: the original data, the MinMax-scaled data, and
the Standard-scaled data.The visual comparison highlights how these
scalers affect the distribution of features, showing their effects on model
inputs.By comparing these methods, you gain insight into when and how to
apply normalization techniques depending on the specific requirements of
your machine learning model.
【Trivia】
Data normalization is particularly important in algorithms like K-means
clustering and Support Vector Machines (SVM) because they are sensitive
to the scales of input features.Even simple linear models may perform
better after normalization if the feature scales are vastly different.
15. Creating and Visualizing Interaction Terms in
Machine Learning
Importance★★★★☆
Difficulty★★★☆☆
A retail company is analyzing customer purchase patterns to optimize their
marketing strategies.They have data on two key customer attributes:
'Annual Income' and 'Age.'The company believes that the interaction
between 'Annual Income' and 'Age' can reveal valuable insights into
customer spending behavior.Your task is to create a synthetic dataset
containing these attributes and an additional 'Spending Score' that depends
on them.You need to engineer an interaction term between 'Annual Income'
and 'Age,' then visualize its relationship with 'Spending Score' using a
scatter plot.Generate the dataset within your code, then visualize the
interaction term's effect on the 'Spending Score.'
【Data Generation Code Example】
import numpy as np

import pandas as pd

##Create synthetic data

np.random.seed(42)

annual_income = np.random.normal(60000, 15000, 200)

age = np.random.normal(40, 12, 200)

spending_score = 0.5 * annual_income + 2 * age + np.random.normal(0,


1000, 200)

##Combine into DataFrame

data = pd.DataFrame({'Annual_Income': annual_income, 'Age': age,


'Spending_Score': spending_score})
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

##Create synthetic data

np.random.seed(42)

annual_income = np.random.normal(60000, 15000, 200)

age = np.random.normal(40, 12, 200)


spending_score = 0.5 * annual_income + 2 * age + np.random.normal(0,
1000, 200)

##Combine into DataFrame and create interaction term

data = pd.DataFrame({'Annual_Income': annual_income, 'Age': age,


'Spending_Score': spending_score})

data['Income_Age_Interaction'] = data['Annual_Income'] * data['Age']

##Plot the interaction term vs. Spending Score

plt.scatter(data['Income_Age_Interaction'], data['Spending_Score'])

plt.xlabel('Income-Age Interaction')

plt.ylabel('Spending Score')

plt.title('Interaction Term vs. Spending Score')

plt.show()

Feature engineering is essential in machine learning as it can significantly


enhance a model's performance.An interaction term is created by combining
two or more features, reflecting how they might influence the target
variable in a non-additive manner.In this scenario, the interaction term
between 'Annual Income' and 'Age' helps us understand if certain age-
income combinations correlate with higher or lower spending scores.Data
Generation:We use the numpy library to generate synthetic data for 'Annual
Income' and 'Age' based on normal distributions.The 'Spending Score' is
calculated using a linear combination of these features, with some added
noise to simulate real-world data variability.Creating the Interaction
Term:The interaction term is calculated by multiplying 'Annual Income' by
'Age.'This new feature captures the combined effect of both attributes,
highlighting potential synergies between income levels and age in spending
patterns.Visualization:Using matplotlib.pyplot, we plot the interaction term
against the 'Spending Score.'This scatter plot helps visualize any potential
relationship, indicating whether certain combinations of income and age
align with high or low spending scores.Such engineered features can
sometimes reveal patterns that linear models alone might miss.In practice,
the interaction term could be used in various models to predict spending
behavior more accurately.
【Trivia】
‣ Interaction terms are often vital in linear regression models as they allow
for the representation of nonlinear relationships without transforming the
entire model.‣ They can also help detect moderation effects, where one
variable modifies the effect of another.
16. Visualizing Customer Preferences using t-SNE
Importance★★★★☆
Difficulty★★★☆☆
A retail company has conducted a survey to understand its customers'
preferences across multiple product categories. The survey results include
various factors like quality, price, brand reputation, and customer loyalty,
resulting in high-dimensional data.The company needs a way to visualize
this complex data to identify patterns and customer segments. Your task is
to apply t-SNE to reduce the dimensions of this data and create a 2D scatter
plot that visualizes the clusters of customer preferences. This visualization
will help the company gain insights into different customer groups based on
their preferences.Implement a solution in Python to achieve this. The data
should be generated within the code, and you should produce a scatter plot
that clearly visualizes the clusters.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_blobs

## Generating synthetic high-dimensional data representing customer


preferences

data, labels = make_blobs(n_samples=300, centers=4, n_features=10,


random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

from sklearn.datasets import make_blobs

from sklearn.manifold import TSNE

import matplotlib.pyplot as plt

## Generating synthetic high-dimensional data

data, labels = make_blobs(n_samples=300, centers=4, n_features=10,


random_state=42)

## Applying t-SNE for dimensionality reduction


data_2d = TSNE(n_components=2,
random_state=42).fit_transform(data)

## Plotting the t-SNE results in a 2D scatter plot

plt.figure(figsize=(10, 6))

plt.scatter(data_2d[:, 0], data_2d[:, 1], c=labels, cmap='viridis', s=50,


alpha=0.7)

plt.colorbar(label='Customer Cluster')

plt.title('Customer Preferences Visualization using t-SNE')

plt.xlabel('t-SNE Dimension 1')

plt.ylabel('t-SNE Dimension 2')

plt.show()

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a technique used


for reducing the dimensionality of high-dimensional data into a lower-
dimensional space, typically two or three dimensions, to make visualization
easier. It is particularly useful for visualizing data that has complex
underlying structures, such as clusters, by preserving local similarities. In
this task, we used t-SNE to visualize synthetic data that simulates customer
preferences across several factors.First, we created synthetic high-
dimensional data using make_blobs, which generates clustered data, with
each sample belonging to one of four clusters (customer groups). This
simulated the different preferences of customers, represented in a 10-
dimensional feature space.Next, we applied the t-SNE algorithm from the
sklearn.manifold module. By setting n_components=2, we instructed t-SNE
to reduce the data to two dimensions. The random_state=42 parameter
ensures reproducibility by controlling the random number generator.In the
final step, we used Matplotlib to visualize the t-SNE output. Each data point
corresponds to a customer, and the color indicates which cluster they belong
to. The scatter plot provides an intuitive view of the relationships between
customer preferences. Clusters that are closer together indicate similar
preferences, while clusters that are further apart represent distinct
preferences. This visualization can be useful for segmenting customers and
tailoring marketing strategies based on the identified segments.
【Trivia】
‣ The t-SNE algorithm was introduced by Geoffrey Hinton and Laurens
van der Maaten in 2008.‣ t-SNE is computationally intensive; for large
datasets, it can be quite slow and may require techniques like approximate
t-SNE or GPU acceleration.‣ Although t-SNE is widely used for clustering
and visualization, it doesn’t preserve global distances well, meaning the
relative distances between clusters may not reflect actual dissimilarities
accurately.
17. Polynomial Feature Generation and
Visualization in Machine Learning
Importance★★★★☆
Difficulty★★★☆☆
A real estate company is analyzing the relationship between the square
footage of houses and their selling price.However, they believe that just
using square footage may not capture all the variations in price and that
polynomial features could provide better insights.Your task is to generate
polynomial features of the house sizes and visualize how these features
improve the model's predictions.Use Python's PolynomialFeatures from
sklearn.preprocessing to create second-degree polynomial features and then
visualize the relationship between the house sizes and prices using a scatter
plot.Generate the input data for house sizes and prices programmatically,
and perform the necessary transformations and plotting to help the company
understand this relationship better.
【Data Generation Code Example】
import numpy as np

np.random.seed(0)

## Generate random house sizes between 500 and 3500 square feet

sizes = np.random.randint(500, 3500, 50)

## Simulate prices using a polynomial function of the sizes, adding noise

prices = 150 + 0.05 * sizes + 0.0001 * sizes**2 + np.random.normal(0,


10000, 50)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.preprocessing import PolynomialFeatures

from sklearn.linear_model import LinearRegression

np.random.seed(0)

## Generate random house sizes between 500 and 3500 square feet

sizes = np.random.randint(500, 3500, 50)


## Simulate prices using a polynomial function of the sizes, adding noise

prices = 150 + 0.05 * sizes + 0.0001 * sizes**2 + np.random.normal(0,


10000, 50)

## Reshape the data to fit the model

sizes = sizes.reshape(-1, 1)

## Create polynomial features

poly = PolynomialFeatures(degree=2)

sizes_poly = poly.fit_transform(sizes)

## Fit the linear regression model with polynomial features

model = LinearRegression().fit(sizes_poly, prices)


predicted_prices = model.predict(sizes_poly)

## Plot the original data and the model's predictions

plt.scatter(sizes, prices, color='blue', label='Actual Prices')


plt.plot(sizes, predicted_prices, color='red', label='Predicted Prices')

plt.title('House Prices vs. Sizes with Polynomial Features')

plt.xlabel('Size (square feet)')

plt.ylabel('Price (USD)')

plt.legend()

plt.show()

In this problem, we aim to explore the use of polynomial features to capture


non-linear relationships between input data (house sizes) and the target
variable (house prices).Polynomial features allow machine learning models
to learn more complex patterns by transforming the original data into
higher-dimensional space.In this case, we are transforming the house sizes
into second-degree polynomial features, which adds not only the original
feature but also its squared value.This helps in capturing the non-linear
effects of size on price, which may not be well-represented by just the size
alone.We first generate random house sizes using np.random.randint() and
simulate the corresponding prices using a second-degree polynomial
equation, adding some noise to make the data more realistic.Then, we
reshape the sizes data to fit into the machine learning model.Next, we use
PolynomialFeatures(degree=2) from sklearn.preprocessing to create
polynomial features for the house sizes.This transformation converts the
one-dimensional size data into a two-dimensional array where the first
column is the constant term (bias), the second column is the original size,
and the third column is the squared size.After transforming the data, we fit a
linear regression model using LinearRegression() from
sklearn.linear_model.The model is then used to predict house prices based
on the polynomial features.Finally, we visualize the relationship between
the original house sizes and the actual prices, as well as the predicted prices,
using matplotlib.The scatter plot displays the actual house prices, while the
red line shows the model's predictions based on the polynomial
features.This visualization helps demonstrate how polynomial features can
improve predictions by better fitting the underlying trend in the data.
【Trivia】
Polynomial regression is especially useful when the relationship between
the independent variable and the dependent variable is non-linear, but there
is no need to switch to a more complex machine learning model.It enhances
the flexibility of simple linear models while maintaining their
interpretability.
18. Custom Loss Function Visualization for a
Manufacturing Optimization Problem
Importance★★★★☆
Difficulty★★★☆☆
A company manufactures products with varying production times and costs.
The management wants to minimize both time and cost using a custom loss
function. Your task is to create a neural network model that predicts product
production time based on features like material quality, machine efficiency,
and worker skill. The loss function should penalize errors in predicting
production time and cost, with more weight on reducing production
time.Create a custom loss function that balances these two aspects and
visualize the learning process. The chart should show the training and
validation loss over time.Use a dataset of 100 products with random values
for material quality, machine efficiency, and worker skill as features.
【Data Generation Code Example】
import numpy as np

np.random.seed(42)

X = np.random.rand(100, 3) # Features: material quality, machine


efficiency, worker skill

y_time = np.random.rand(100) * 10 # Production time

y_cost = np.random.rand(100) * 500 # Production cost


【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

from tensorflow.keras import backend as K

# # Generate synthetic data for the task #


X = np.random.rand(100, 3)

y_time = np.random.rand(100) * 10 # Production time

y_cost = np.random.rand(100) * 500 # Production cost

# # Train-test split #

X_train, X_val, y_train_time, y_val_time = train_test_split(X, y_time,


test_size=0.2, random_state=42)

_, _, y_train_cost, y_val_cost = train_test_split(X, y_cost, test_size=0.2,


random_state=42)

# # Custom loss function combining time and cost #

def custom_loss(y_true, y_pred):

time_loss = K.mean(K.square(y_true - y_pred)) # Mean squared error


for production time

cost_loss = K.mean(K.square(y_true - y_pred) * 0.1) # Lower weight


for production cost

return time_loss + cost_loss

# # Define the model #

model = Sequential([

Dense(64, input_dim=3, activation='relu'),

Dense(32, activation='relu'),

Dense(1) # Output layer for predicting production time

])
# # Compile the model using the custom loss function #

model.compile(optimizer='adam', loss=custom_loss)
# # Train the model #

history = model.fit(X_train, y_train_time, validation_data=(X_val,


y_val_time), epochs=50, verbose=0)

# # Plot the training and validation loss #

plt.plot(history.history['loss'], label='Training Loss')

plt.plot(history.history['val_loss'], label='Validation Loss')

plt.title('Custom Loss Function - Training vs Validation Loss')

plt.xlabel('Epochs')

plt.ylabel('Loss')

plt.legend()

plt.show()

In this exercise, we define a custom loss function to balance two goals:


minimizing production time and production cost. This is a realistic scenario
for a manufacturing company, where both time and cost efficiency are
critical. However, more importance is placed on time reduction, so we
assign a higher weight to the time error compared to the cost error.The
model is a simple neural network built using the TensorFlow/Keras
framework, which predicts production time based on three input features:
material quality, machine efficiency, and worker skill. These features are
represented as a 3-dimensional input to the model, and the output is a single
value representing production time.We use mean squared error (MSE) for
both time and cost but apply a smaller weight to the cost part of the loss
function (10% of the weight). This adjustment ensures that errors in
predicting production time have a more significant impact on model
optimization.The code trains the model over 50 epochs, during which the
loss function is optimized. Finally, a plot is generated to show how the
training and validation loss evolve over time, allowing us to visually assess
the model's learning process and whether it generalizes well to unseen
data.This setup teaches the reader how to create and apply custom loss
functions, a key skill in tailoring machine learning models to specific
business needs.
【Trivia】
Custom loss functions are especially useful in machine learning when
standard metrics (like mean squared error or cross-entropy) do not fully
capture the real-world costs or trade-offs a business might face. For
example, in industries like healthcare, manufacturing, or finance, custom
loss functions are often crafted to prioritize the most critical factors.
19. Cross-Validation Results Visualization in
Machine Learning
Importance★★★★☆
Difficulty★★★☆☆
A company wants to ensure that the machine learning model they are
developing is reliable for predicting customer churn.You are tasked with
evaluating the model's performance using k-fold cross-validation and
visualizing the results.The model to be evaluated is a decision tree
classifier.The company is interested in seeing how consistent the model's
performance is across different cross-validation folds.Generate synthetic
data for this task.Write a Python program to:Generate sample data where
the features represent customer activity and the target is whether the
customer churns.Perform k-fold cross-validation (with k=5).Visualize the
cross-validation scores using a line plot to show the variation in
performance across folds.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100, n_features=5,
n_informative=3, n_classes=2, random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

from sklearn.datasets import make_classification

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import cross_val_score

import matplotlib.pyplot as plt

# Create synthetic data

X, y = make_classification(n_samples=100, n_features=5,
n_informative=3, n_classes=2, random_state=42)
# Initialize the decision tree classifier

clf = DecisionTreeClassifier()

# Perform k-fold cross-validation with 5 folds

cv_scores = cross_val_score(clf, X, y, cv=5)

# Visualize the cross-validation results

plt.plot(range(1, 6), cv_scores, marker='o', linestyle='-', color='b')

plt.title("Cross-Validation Scores per Fold")

plt.xlabel("Fold Number")

plt.ylabel("Accuracy Score")

plt.ylim(0.0, 1.0)

plt.grid(True)

plt.show()

In machine learning, cross-validation is a critical technique to assess the


generalization performance of a model.It is particularly useful when the
dataset is small, as it helps to ensure that the model can perform well on
unseen data.In this exercise, we generated synthetic data using the
make_classification function, which simulates a binary classification
problem.The data consists of features representing customer activity and a
target variable indicating whether a customer churned.Next, we used a
decision tree classifier to model this data.To evaluate the classifier, we
applied 5-fold cross-validation.This technique divides the dataset into five
equal parts, trains the model on four of these parts, and tests it on the
remaining part.This process is repeated five times, ensuring that each part
of the data gets tested once.The results of each fold (accuracy scores) are
stored and then plotted.The line plot shows the accuracy across the folds,
which gives us insight into how consistently the model performs.If the
scores vary significantly between folds, it may indicate that the model's
performance is unstable, and further tuning or a different model might be
necessary.
【Trivia】
Cross-validation is often used in model selection, where it helps to compare
the performance of multiple models and pick the one that generalizes best.
20. Visualizing Decision Boundaries of Machine
Learning Classifiers
Importance★★★★☆
Difficulty★★★☆☆
A company is developing a machine learning model to classify two
different types of customer behavior based on their website usage data.Your
task is to train a classifier that can differentiate between two types of
customer behaviors (Class A and Class B) based on two features:
time_spent and clicks.Once the classifier is trained, visualize the decision
boundary of the model to understand how it separates the two
behaviors.You should generate some sample data and visualize the decision
boundary of two different classifiers:A Support Vector Machine (SVM)A
K-Nearest Neighbors (KNN) classifier.
【Data Generation Code Example】
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1,
random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification


from sklearn.svm import SVC

from sklearn.neighbors import KNeighborsClassifier

from matplotlib.colors import ListedColormap


# Generate synthetic classification data

X, y = make_classification(n_samples=100, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1,
random_state=42)

# Define the classifiers


classifiers = [SVC(kernel='linear'),
KNeighborsClassifier(n_neighbors=5)]

classifier_names = ['SVM', 'KNN']

# Create meshgrid for plotting

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min,


y_max, 0.02))

# Plot decision boundaries

plt.figure(figsize=(10, 5))

cmap_light = ListedColormap(['#FFAAAA', '#AAAAFF'])

cmap_bold = ListedColormap(['#FF0000', '#0000FF'])

for idx, clf in enumerate(classifiers):

plt.subplot(1, 2, idx + 1)

clf.fit(X, y)

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, cmap=cmap_light)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, edgecolor='k',


s=20)

plt.title(f"Decision Boundary ({classifier_names[idx]})")

plt.xlabel('Time Spent')
plt.ylabel('Clicks')

plt.tight_layout()

plt.show()

The code starts by importing necessary libraries: numpy for numerical


operations, matplotlib for plotting, and sklearn modules for data generation
and classifiers.We use make_classification to generate a synthetic dataset
with two informative features (time_spent and clicks). These features
represent customer behavior data. The y variable contains labels, which are
used to differentiate two types of customer behaviors: Class A and Class
B.For visualization, we define two classifiers: a Support Vector Machine
(SVM) and a K-Nearest Neighbors (KNN) classifier. The SVM aims to find
a linear decision boundary, while the KNN uses a non-linear approach
based on neighborhood votes.Next, we create a meshgrid to plot the
decision boundary over the feature space. The meshgrid is essential for
visualizing the boundaries as it provides a fine-grained grid of points where
the classifiers will predict labels.For each classifier, we fit it on the data and
predict the labels for the meshgrid points. These predicted labels are then
used to plot the decision boundaries using contourf. The scatter plot
overlays the original data points to show where the true classes lie relative
to the boundary.In this setup, SVM is expected to draw a straight line to
separate the two classes, while KNN will likely produce a more jagged
decision boundary due to its neighborhood-based nature.The objective of
this exercise is to help you understand how different classifiers divide the
feature space into regions associated with different classes. Decision
boundaries are important in understanding how well a model generalizes to
unseen data.
【Trivia】
SVMs are known for their ability to find the maximum margin hyperplane
between classes, making them effective even with a small number of
samples. KNN, on the other hand, is simple but can struggle with large
datasets due to its computational cost.
21. Gradient Descent Optimization Path
Visualization in Machine Learning
Importance★★★★☆
Difficulty★★★☆☆
Imagine you are working for a company that is developing a machine
learning model to predict housing prices based on various features such as
the number of bedrooms, square footage, and location.To train this model,
you want to ensure that the gradient descent optimization algorithm is
working properly by visualizing its path during training.You are required to
create synthetic data for house prices based on a simple linear
equation.Next, implement gradient descent optimization to minimize the
loss function and visualize how the algorithm converges to the minimum
point by plotting the path taken by gradient descent.The goal is to
understand the optimization process, so ensure that the model’s weights
(parameters) are updated correctly and visualize the path of
optimization.Create a plot that shows the cost function value (MSE) and the
optimization path over iterations.
【Data Generation Code Example】
import numpy as np

X = np.linspace(1, 10, 100)

y = 2 * X + 1 + np.random.randn(100) * 2
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

X = np.linspace(1, 10, 100)


y = 2 * X + 1 + np.random.randn(100) * 2

# Normalize the features for faster gradient descent convergence

X_norm = (X - np.mean(X)) / np.std(X)


y_norm = (y - np.mean(y)) / np.std(y)

# Initialize parameters (weights)

theta0 = 0

theta1 = 0

learning_rate = 0.1
iterations = 50

# Store cost and theta values for each iteration

cost_history = []

theta_history = []

# Define Mean Squared Error (MSE) cost function

def compute_cost(X, y, theta0, theta1):

return np.mean((theta0 + theta1 * X - y) ** 2) / 2

# Perform gradient descent

for i in range(iterations):

predictions = theta0 + theta1 * X_norm

error = predictions - y_norm

theta0 -= learning_rate * np.mean(error)


theta1 -= learning_rate * np.mean(error * X_norm)

cost = compute_cost(X_norm, y_norm, theta0, theta1)

cost_history.append(cost)

theta_history.append((theta0, theta1))

# Convert theta history to arrays for plotting

theta_history = np.array(theta_history)

# Plot the cost function and optimization path

plt.figure(figsize=(10, 5))

# Plot 1: Cost function over iterations


plt.subplot(1, 2, 1)

plt.plot(cost_history, label='Cost (MSE)')

plt.title('Cost Function vs. Iterations')

plt.xlabel('Iterations')

plt.ylabel('Cost (MSE)')

plt.legend()

# Plot 2: Optimization path in parameter space

plt.subplot(1, 2, 2)

plt.plot(theta_history[:, 1], label='theta1 path')


plt.title('Gradient Descent Optimization Path')

plt.xlabel('Iterations')

plt.ylabel('theta1')

plt.legend()

plt.tight_layout()

plt.show()

This problem focuses on understanding how gradient descent optimizes the


cost function to minimize the error between predicted and actual values.
In machine learning, gradient descent is a common optimization algorithm
used to minimize a loss function, typically the mean squared error (MSE) in
regression tasks.
The data here is synthetic, generated based on a linear relationship between
a feature X (such as the number of bedrooms or house size) and the target y
(price). Noise is added to y to simulate real-world variations.
To make gradient descent more effective, we normalize X and y, ensuring
that the optimization process converges faster.
The model's parameters (theta0 and theta1) represent the intercept and slope
of the line we're trying to fit.
The cost function quantifies how far off the model's predictions are from
the actual data points, and gradient descent iteratively updates the
parameters to reduce this cost.
In each iteration, we calculate the error and adjust the parameters using the
learning rate.
The first plot shows the cost (error) over time, while the second plot
visualizes the changes in the parameter theta1 as it moves toward the
optimal value, demonstrating the path taken by gradient descent.
This exercise is critical for understanding optimization in machine learning
models and how the parameters adjust during training.
【Trivia】
Gradient descent was first introduced by French mathematician Augustin-
Louis Cauchy in 1847! Though initially not related to machine learning, its
use in optimization became central in training neural networks and
regression models in modern data science.
22. Clustering Evaluation Using Silhouette Score
and Visualization
Importance★★★★☆
Difficulty★★★☆☆
A retail company has collected customer purchasing behavior data, and they
want to segment the customers into distinct groups for targeted marketing
campaigns.You are asked to apply clustering to group the customers based
on their purchasing patterns and evaluate the quality of the clusters using
the Silhouette Score.Visualize the clustering result and the Silhouette Score
to assess the performance of the clustering model.Use KMeans to perform
the clustering and calculate the Silhouette Score.Please generate synthetic
customer purchasing data within your solution.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_blobs

#Generate synthetic data with 3 centers representing 3 customer groups


data, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.0,
random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_score, silhouette_samples

import matplotlib.cm as cm

#Generate synthetic data


data, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.0,
random_state=42)

#Apply KMeans clustering

kmeans = KMeans(n_clusters=3, random_state=42)

labels = kmeans.fit_predict(data)

#Calculate the Silhouette Score and samples

silhouette_avg = silhouette_score(data, labels)

sample_silhouette_values = silhouette_samples(data, labels)

#Set up figure for Silhouette Score visualization

fig, (ax1, ax) = plt.subplots(1, 2)


fig.set_size_inches(18, 7)

#First plot: Silhouette Score for each sample

y_lower = 10
for i in range(3):

ith_cluster_silhouette_values = sample_silhouette_values[labels == i]

ith_cluster_silhouette_values.sort()

size_cluster_i = ith_cluster_silhouette_values.shape[0]

y_upper = y_lower + size_cluster_i

color = cm.nipy_spectral(float(i) / 3)

ax1.fill_betweenx(np.arange(y_lower, y_upper), 0,
ith_cluster_silhouette_values, facecolor=color, edgecolor=color,
alpha=0.7)
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

y_lower = y_upper + 10

#Plot silhouette score line

ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

ax1.set_title("The silhouette plot for the various clusters.")

ax1.set_xlabel("Silhouette coefficient values")

ax1.set_ylabel("Cluster label")

ax1.set_yticks([])

ax1.set_xticks(np.arange(-0.1, 1.1, 0.2))


#Second plot: Cluster visualization

colors = cm.nipy_spectral(labels.astype(float) / 3)

ax.scatter(data[:, 0], data[:, 1], marker=".", s=30, lw=0, alpha=0.7,


c=colors, edgecolor="k")

#Plot the centroids

centers = kmeans.cluster_centers_

ax.scatter(centers[:, 0], centers[:, 1], marker="o", c="white", alpha=1,


s=200, edgecolor="k")

for i, c in enumerate(centers):

ax.scatter(c[0], c[1], marker="$%d$" % i, alpha=1, s=50, edgecolor="k")

ax.set_title("Clustered data visualization with KMeans.")

ax.set_xlabel("Feature 1")

ax.set_ylabel("Feature 2")
plt.show()

This task involves clustering customer data using KMeans and evaluating
the clustering quality through the Silhouette Score.We first generated
synthetic data representing customer purchasing patterns using the
make_blobs function, which creates clusters of data points with specific
centers and variations.KMeans is a popular clustering algorithm that groups
data points by minimizing the distance between the points and the cluster
centroids. In this case, we specified that the data should be grouped into 3
clusters. The fit_predict function was used to assign each data point to one
of the clusters based on the KMeans algorithm.To evaluate the quality of
the clustering, we calculated the Silhouette Score. The Silhouette Score
measures how similar a data point is to its own cluster compared to other
clusters. A higher score indicates that the point is well-matched to its
cluster, while a score close to 0 or negative suggests that it is near the
boundary between clusters.The visualization consisted of two parts:‣ The
first plot visualizes the silhouette coefficients for each data point, allowing
us to assess the consistency of each cluster.‣ The second plot shows the
clustered data in two dimensions, with each point color-coded according to
its assigned cluster, and the centroids are marked.This visualization helps us
understand how well the clusters are formed and whether they overlap. The
red dashed line in the silhouette plot represents the average silhouette score,
providing a global measure of clustering quality.
【Trivia】
The Silhouette Score ranges from -1 to 1, where a score close to 1 indicates
that clusters are well-separated, and a score close to -1 suggests that data
points may be wrongly clustered.
23. Comparison of Min-Max Scaling and
Standardization in Machine Learning
Importance★★★★☆
Difficulty★★★☆☆
A retail company has a dataset containing the weekly sales and the number
of customers for each of their stores.
They want to analyze this data using machine learning models to predict
sales, but they are unsure about the best scaling method to use for
preprocessing.
Your task is to:Generate a synthetic dataset representing weekly sales (in
thousands of dollars) and the number of customers.Apply both Min-Max
Scaling and Standardization to the dataset.Plot the scaled data for each
feature under both scaling methods.
By visualizing this, the company hopes to understand how these scaling
techniques affect the distribution of the data and which might be better for
predictive modeling.

【Data Generation Code Example】


import numpy as np

import pandas as pd

## Generate synthetic data for weekly sales and customer numbers

data = {'sales': np.random.randint(100, 1000, 50), 'customers':


np.random.randint(500, 5000, 50)}
df = pd.DataFrame(data)
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler, StandardScaler

## Generate synthetic data for weekly sales and customer numbers

data = {'sales': np.random.randint(100, 1000, 50), 'customers':


np.random.randint(500, 5000, 50)}

df = pd.DataFrame(data)
## Apply Min-Max Scaling

min_max_scaler = MinMaxScaler()

min_max_scaled = min_max_scaler.fit_transform(df)

## Apply Standardization
standard_scaler = StandardScaler()

standard_scaled = standard_scaler.fit_transform(df)
## Plot the scaled data

fig, ax = plt.subplots(1, 2, figsize=(12, 6))

ax[0].scatter(min_max_scaled[:, 0], min_max_scaled[:, 1], c='blue',


label='Min-Max Scaled Data')

ax[0].set_title('Min-Max Scaling')

ax[0].set_xlabel('Sales')

ax[0].set_ylabel('Customers')

ax[0].legend()

ax[1].scatter(standard_scaled[:, 0], standard_scaled[:, 1], c='red',


label='Standardized Data')

ax[1].set_title('Standardization')

ax[1].set_xlabel('Sales')

ax[1].set_ylabel('Customers')

ax[1].legend()

plt.tight_layout()

plt.show()

The data scaling methods Min-Max Scaling and Standardization are


essential in machine learning as they impact model performance by
affecting features' distributions.
Min-Max Scaling transforms features to a specific range, often [0, 1],
preserving the data’s shape but compressing extreme values. This technique
is suitable when the dataset has a known, finite range and the model
benefits from normalized values, such as in neural networks.In contrast,
Standardization shifts the data's mean to zero and adjusts the standard
deviation to one, making the distribution centered and ideal for algorithms
assuming Gaussian distribution, like linear regression and k-nearest
neighbors.
To implement these, the code uses the MinMaxScaler and StandardScaler
from sklearn.preprocessing.After generating a synthetic dataset, we create
scaler objects and apply fit_transform to each feature array to scale the data.
The plot visually compares the data under Min-Max Scaling and
Standardization, revealing that Min-Max maintains relative distances but
compresses values, while Standardization maintains the spread around a
mean of zero.
This visual guide assists in selecting a method based on the data’s
characteristics and model requirements.
【Trivia】
‣ The choice of scaling method often depends on the model used; for
example, decision trees are unaffected by feature scaling as they split data
based on feature thresholds rather than distance.
‣ When working with sparse data, Min-Max Scaling is generally preferred
because it preserves zero values.
24. Visualizing Model Learning Curves in Python
Importance★★★★☆
Difficulty★★★☆☆
A real estate company wants to develop a model to predict housing prices
based on various features like area, number of bedrooms, and age of the
house.To ensure the model is performing well and avoiding overfitting or
underfitting, they need to visualize the learning curves for both training and
validation sets.Your task is to create sample data and train a machine
learning model, then plot the learning curves to visualize how well the
model performs as it is trained on increasing amounts of data.Please include
both training and validation learning curves in your plot.
【Data Generation Code Example】
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split


from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

np.random.seed(0)
n_samples = 200

X = np.random.rand(n_samples, 3) * 100

y = 3 * X[:, 0] + 2 * X[:, 1] + X[:, 2] + np.random.randn(n_samples) * 10


【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt

np.random.seed(0)
X = np.random.rand(200, 3) * 100

y = 3 * X[:, 0] + 2 * X[:, 1] + X[:, 2] + np.random.randn(200) * 10

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3,


random_state=42)

train_errors = [mean_squared_error(y_train[:m],
LinearRegression().fit(X_train[:m], y_train[:m]).predict(X_train[:m])) for
m in range(1, len(y_train)+1)]

val_errors = [mean_squared_error(y_val,
LinearRegression().fit(X_train[:m], y_train[:m]).predict(X_val)) for m in
range(1, len(y_train)+1)]

plt.plot(range(1, len(y_train)+1), train_errors, label='Training Error')

plt.plot(range(1, len(y_train)+1), val_errors, label='Validation Error')

plt.xlabel('Training Set Size')

plt.ylabel('Mean Squared Error')


plt.title('Learning Curves')

plt.legend()

plt.show()

To start, we create synthetic data representing house prices based on three


features (e.g., area, bedrooms, age). These features are generated randomly,
and a linear relationship with some noise is applied to simulate housing
prices.The train_test_split function is then used to split the data into training
and validation sets, which will help evaluate the model’s performance
during training.For the learning curve, we incrementally train a linear
regression model on increasing portions of the training set. For each subset
size, we calculate the mean squared error (MSE) on both the training and
validation sets. The training error typically decreases as the model has more
data to learn from, while the validation error may decrease initially and then
stabilize or increase if overfitting occurs.Finally, both training and
validation errors are plotted against the training set size to visualize the
model's performance. The plot helps identify overfitting or underfitting
based on the gap and trend between the two curves:‣ A small gap with both
errors converging indicates good performance.‣ A large gap with high
training errors may indicate underfitting.‣ A large gap with low training
errors but high validation errors may suggest overfitting.
【Trivia】
‣ Learning curves can be useful to compare different models and
algorithms, as they visually show which model performs better with
different data sizes.‣ Overfitting can be mitigated by collecting more data
or using regularization techniques, while underfitting can often be
addressed by increasing model complexity.‣ The choice of performance
metric (e.g., MSE, accuracy) in the learning curve can vary based on the
specific problem and model being used.
25. Class Distribution Visualization Before and
After Applying SMOTE
Importance★★★★★
Difficulty★★★☆☆
A retail company is analyzing customer churn, and they notice a class
imbalance in the dataset where far fewer customers churn compared to
those who stay.You have been tasked with solving this issue using Synthetic
Minority Over-sampling Technique (SMOTE) and visualizing the class
distribution before and after balancing the dataset.Create a classification
problem dataset with an imbalance in the target variable, then apply
SMOTE to balance it. Visualize both the class distributions before and after
SMOTE using bar charts.Ensure that the data generation is part of the
solution and no external dataset is used.
【Data Generation Code Example】
from sklearn.datasets import make_classification

from collections import Counter

from imblearn.over_sampling import SMOTE

import matplotlib.pyplot as plt

# Generate imbalanced data

X, y = make_classification(n_classes=2, class_sep=2, weights=[0.9, 0.1],


n_informative=3, n_redundant=1, flip_y=0,

n_features=5, n_clusters_per_class=1, n_samples=1000,


random_state=42)
【Diagram Answer】

【Code Answer】
import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from collections import Counter

from imblearn.over_sampling import SMOTE

# Generate imbalanced data

X, y = make_classification(n_classes=2, class_sep=2, weights=[0.9, 0.1],


n_informative=3, n_redundant=1, flip_y=0,
n_features=5, n_clusters_per_class=1,
n_samples=1000, random_state=42)

# Plot initial class distribution

counter_before = Counter(y)

plt.bar(counter_before.keys(), counter_before.values())

plt.title("Class Distribution Before SMOTE")

plt.xlabel("Class")

plt.ylabel("Frequency")

plt.show()

# Apply SMOTE to balance the dataset


smote = SMOTE(random_state=42)

X_resampled, y_resampled = smote.fit_resample(X, y)

# Plot class distribution after SMOTE


counter_after = Counter(y_resampled)

plt.bar(counter_after.keys(), counter_after.values())

plt.title("Class Distribution After SMOTE")

plt.xlabel("Class")

plt.ylabel("Frequency")

plt.show()

In this task, we aim to visualize class imbalance before and after applying
SMOTE (Synthetic Minority Over-sampling Technique).This technique is
commonly used in classification problems where one class is
underrepresented compared to another, causing skewed model training.We
first generate synthetic data using the make_classification function from
sklearn.datasets, allowing us to specify an imbalanced class distribution
using the weights parameter.Here, we set 90% of the samples to belong to
class 0 (non-churn) and only 10% to class 1 (churn), creating a highly
imbalanced dataset.To see the effect of this imbalance, we use the Counter
function from Python’s collections module to count the instances of each
class and visualize the distribution with a simple bar chart using
matplotlib.This initial plot clearly shows the class imbalance in the target
variable.Next, we apply SMOTE, which generates synthetic examples of
the minority class (in this case, class 1) to balance the dataset.SMOTE
operates by creating new synthetic examples between existing minority
class examples, preserving the feature space structure.After applying
SMOTE, we again use Counter to count the new, balanced class distribution
and plot the result.The second bar chart shows that the class distribution has
been equalized, with both classes now having similar frequencies, which
helps in building a more effective classification model.SMOTE is
particularly useful in machine learning models where class imbalance can
lead to biased predictions and poor performance on the minority class.By
balancing the dataset, the model can learn to predict both classes more
accurately.
【Trivia】
Did you know that SMOTE isn't the only technique to handle imbalanced
datasets?Other methods include undersampling the majority class or using
advanced ensemble techniques like Random Forest with class weights.Each
method has its pros and cons, depending on the dataset size and the nature
of the problem you're solving!
26. Visualizing Feature Importance in Machine
Learning Models
Importance★★★★☆
Difficulty★★★☆☆
Imagine that you are working as a data analyst for a retail company, and
your task is to help improve customer retention by identifying key factors
that influence customer spending. You have data on various customer
attributes, such as age, income, number of visits, and previous purchases.
Your goal is to determine which factors are the most important in predicting
customer spending to better target marketing efforts.Your task is
to:Generate a sample dataset that includes customer spending and
associated attributes.Train a decision tree regressor model on this dataset to
predict customer spending.Visualize the feature importance using a bar
chart to showcase which attributes have the most impact on customer
spending.You need to write code to generate the data, train the model, and
create a visualization of the feature importance. The visualization should
clearly show the ranking of each feature based on its importance in
predicting spending.
【Data Generation Code Example】
import numpy as np

import pandas as pd

## Generate a dataset with 100 samples and relevant features

np.random.seed(42)

data = pd.DataFrame({

'Age': np.random.randint(18, 65, 100),

'Income': np.random.randint(30000, 120000, 100),

'Visits': np.random.randint(1, 30, 100),


'Previous_Purchases': np.random.randint(0, 50, 100),

'Spending': np.random.randint(100, 2000, 100)

})
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

from sklearn.tree import DecisionTreeRegressor

import matplotlib.pyplot as plt


## Generate a dataset with 100 samples and relevant features

np.random.seed(42)

data = pd.DataFrame({

'Age': np.random.randint(18, 65, 100),


'Income': np.random.randint(30000, 120000, 100),

'Visits': np.random.randint(1, 30, 100),

'Previous_Purchases': np.random.randint(0, 50, 100),

'Spending': np.random.randint(100, 2000, 100)

})

## Split data into features and target

X = data.drop('Spending', axis=1)

y = data['Spending']

## Train Decision Tree Regressor model

model = DecisionTreeRegressor()

model.fit(X, y)

## Get feature importance and visualize


feature_importance = model.feature_importances_

features = X.columns

plt.barh(features, feature_importance)

plt.xlabel('Importance')

plt.title('Feature Importance for Predicting Customer Spending')

plt.show()

In this exercise, you are working to visualize the importance of different


features in a dataset by training a machine learning model and displaying
the results in a bar chart. Here’s a breakdown of the main steps involved:‣
Data Generation: The sample dataset consists of 100 customers and several
attributes: age, income, number of visits, and previous purchases. These
attributes serve as features that may influence customer spending, which is
our target variable. Using the numpy and pandas libraries, we generated
random data to simulate a realistic retail environment.‣ Data Preparation:
We separate our features (Age, Income, Visits, Previous_Purchases) from
the target variable (Spending). This is essential for training the model, as
the target variable represents the value we want to predict.‣ Model
Training: The DecisionTreeRegressor is used to fit our dataset. This model
is well-suited for understanding feature importance because it splits the data
based on the features that most reduce error at each step. The model will
analyze how much each feature contributes to accurate predictions of
Spending.‣ Feature Importance Extraction: Once the model is trained, the
feature_importances_ attribute provides a measure of each feature's relative
importance. Higher values indicate features with more influence on the
target variable.‣ Visualization: The final step is to plot the feature
importances using a horizontal bar chart. This chart provides a visual
ranking of features, showing which attributes most strongly affect customer
spending. In this example, matplotlib is used to create the visualization,
with plt.barh() for a horizontal bar plot. The labels and titles are added to
make the chart easy to interpret.This type of analysis is critical in many
industries as it allows data professionals to focus on the most impactful
factors, thus helping optimize decision-making, marketing strategies, and
overall customer retention.
【Trivia】
Did you know that the DecisionTreeRegressor can sometimes be prone to
overfitting on small datasets? This occurs because it tends to create very
specific decision boundaries. One way to mitigate this is by adjusting the
max_depth parameter, which controls how deep the tree can grow, or by
using ensemble methods like RandomForestRegressor for more generalized
predictions.
27. Understanding Overfitting and Underfitting
with Learning Curves in Machine Learning
Importance★★★★★
Difficulty★★★☆☆
You are a data scientist working for a startup that wants to launch a product
recommendation system.Your goal is to build a predictive model using a
simulated dataset and evaluate how well your model is learning from the
data by examining learning curves.The dataset will include user
information, product ratings, and other features.To help the startup avoid
common pitfalls in machine learning, you need to visualize the model's
learning curves and explain the effects of overfitting and underfitting to the
team.Generate a learning curve that compares training and validation error
rates as you increase the amount of training data.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split, learning_curve

from sklearn.ensemble import RandomForestClassifier

## Create synthetic dataset

X, y = make_classification(n_samples=1000, n_features=20,
n_informative=15, n_redundant=5, random_state=42)

## Split dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split, learning_curve

from sklearn.ensemble import RandomForestClassifier

## Generate synthetic data

X, y = make_classification(n_samples=1000, n_features=20,
n_informative=15, n_redundant=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

## Set up RandomForestClassifier and calculate learning curves


train_sizes, train_scores, test_scores =
learning_curve(RandomForestClassifier(), X_train, y_train,
train_sizes=np.linspace(0.1, 1.0, 10), cv=5, scoring='accuracy',
n_jobs=-1)

## Calculate mean and std for training and validation scores

train_mean = np.mean(train_scores, axis=1)

train_std = np.std(train_scores, axis=1)

test_mean = np.mean(test_scores, axis=1)

test_std = np.std(test_scores, axis=1)

## Plot learning curves

plt.plot(train_sizes, train_mean, label='Training Accuracy')

plt.fill_between(train_sizes, train_mean - train_std, train_mean +


train_std, alpha=0.1)

plt.plot(train_sizes, test_mean, label='Validation Accuracy')

plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std,


alpha=0.1)

plt.xlabel('Training Size')

plt.ylabel('Accuracy')

plt.title('Learning Curves: Training vs Validation Accuracy')

plt.legend()
plt.grid()

plt.show()

Learning curves are essential in machine learning for assessing model


performance as a function of training data size.In this case, a synthetic
dataset is created with informative and redundant features using the
make_classification function.Then, the data is split into training and testing
sets using train_test_split to simulate a real-world scenario.The
learning_curve function from sklearn.model_selection calculates model
performance for different training set sizes.This function uses cross-
validation (cv=5) to ensure a robust estimation of the model's accuracy and
returns scores for both training and validation sets.The
RandomForestClassifier, chosen for its robustness, learns from subsets of
data as train_sizes increases.The mean and standard deviation for both
training and validation scores are calculated using NumPy's mean and std
functions.Plotting the learning curve reveals overfitting or underfitting:‣
Overfitting occurs if the training accuracy is high but validation accuracy is
low, suggesting the model memorizes training data but generalizes poorly.‣
Underfitting is indicated by low accuracy on both training and validation
sets, showing the model lacks capacity to learn from the data.The plot
shows both curves, with training and validation accuracies against training
size, providing insights into how data size affects model learning.
【Trivia】
Learning curves can also be useful for model selection.By observing how
validation scores change, you can detect early signs of overfitting, allowing
you to make adjustments such as reducing model complexity or increasing
training data.
Chapter 3 For advanced
1. DBSCAN Clustering and Visualization for
Customer Segmentation
Importance★★★★☆
Difficulty★★★☆☆
You are working for a retail company that wants to analyze customer
behavior based on purchasing data. Your task is to use DBSCAN clustering
to identify distinct groups of customers. The data includes two features: the
amount of money spent by each customer (in dollars) and the number of
purchases they made.Generate some synthetic data representing these
customers, apply DBSCAN clustering to detect patterns, and visualize the
resulting clusters.Using Python and the DBSCAN algorithm, create a plot
to visualize customer groups based on their spending and purchasing
behavior.Make sure to plot the resulting clusters with different colors and
mark the outliers separately.
【Data Generation Code Example】
import numpy as np

## Create synthetic data representing customer behavior

data = np.random.rand(200, 2) * [100, 50] # Generate random data for


200 customers, scaled
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import DBSCAN

from sklearn.preprocessing import StandardScaler

## Create synthetic data for customer behavior

data = np.random.rand(200, 2) * [100, 50] # Random data for 200


customers

## Scale the data


data_scaled = StandardScaler().fit_transform(data)

## Apply DBSCAN clustering

dbscan = DBSCAN(eps=0.3, min_samples=5)

labels = dbscan.fit_predict(data_scaled)

## Plot the clusters

unique_labels = set(labels)

colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1,


len(unique_labels))]

for k, col in zip(unique_labels, colors):


if k == -1:

col = [0, 0, 0, 1] # Black color for outliers

class_member_mask = labels == k

xy = data[class_member_mask]

plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),


markeredgecolor='k', markersize=6)

plt.title('DBSCAN Clustering of Customers')


plt.xlabel('Spending (in dollars)')

plt.ylabel('Number of Purchases')

plt.show()

This exercise demonstrates clustering using DBSCAN (Density-Based


Spatial Clustering of Applications with Noise), which is a popular
algorithm for unsupervised machine learning. It groups points that are
closely packed together and identifies outliers, which are points that lie
alone.The problem involves analyzing customer behavior data based on two
features: spending (in dollars) and the number of purchases.First, synthetic
data is generated to simulate the purchasing behavior of 200 customers.
Each customer’s data is randomly generated and scaled using
StandardScaler() to normalize the values, ensuring that both features are on
a similar scale. DBSCAN is then applied with parameters eps=0.3
(maximum distance between two samples for them to be considered part of
the same neighborhood) and min_samples=5 (minimum number of samples
required to form a cluster).The result of DBSCAN is stored in the labels
array, where each entry indicates the cluster to which a point belongs, and
-1 signifies outliers.For visualization, clusters are plotted using different
colors. The points marked -1 are treated as outliers and are colored black.
The plt.cm.Spectral colormap is used to assign colors to different clusters.
Finally, the plot displays customer segments based on spending and
purchasing patterns.This approach helps businesses understand their
customers' behavior, detect anomalies, and tailor marketing strategies to
specific customer groups.
【Trivia】
DBSCAN does not require specifying the number of clusters beforehand,
unlike k-means. However, its performance depends heavily on the eps
parameter and the density of the data, making it highly suitable for
identifying non-spherical clusters and handling noisy datasets effectively.
2. Visualizing the Effectiveness of Hyperparameter
Tuning in Machine Learning Models
Importance★★★★☆
Difficulty★★★☆☆
You are working for a company that wants to optimize their machine
learning models for predicting sales growth based on historical data.Your
task is to demonstrate how tuning a key hyperparameter can impact model
performance. Specifically, you need to visualize the accuracy of a Random
Forest model as the number of estimators (trees) varies.Use the generated
data to simulate sales data, tune the hyperparameter (number of estimators),
and plot the resulting accuracy for different values.
【Data Generation Code Example】
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.datasets import make_regression

X, y = make_regression(n_samples=500, n_features=20, noise=0.2,


random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor

from sklearn.datasets import make_regression

from sklearn.metrics import mean_squared_error

X, y = make_regression(n_samples=500, n_features=20, noise=0.2,


random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

n_estimators_range = range(10, 210, 10)


errors = [mean_squared_error(y_test,
RandomForestRegressor(n_estimators=n, random_state=42).fit(X_train,
y_train).predict(X_test)) for n in n_estimators_range]

plt.plot(n_estimators_range, errors, marker='o')

plt.title("Random Forest Regressor: Effect of n_estimators on MSE")


plt.xlabel("Number of Estimators")

plt.ylabel("Mean Squared Error (MSE)")

plt.grid(True)

plt.show()

The exercise explores hyperparameter tuning for a Random Forest model,


with the primary focus on visualizing the impact of different values of the
n_estimators hyperparameter on model performance.Random Forests are
ensemble models that use multiple decision trees to increase model
accuracy. The n_estimators parameter controls the number of decision trees
within the model. Generally, more trees increase model stability, but the
trade-off is computation time.In this scenario:We first generate synthetic
data to simulate sales data.We split this data into training and testing sets, a
common practice to assess model generalization.The model is then trained
and tested across a range of values for n_estimators, from 10 to 200 (in
increments of 10). The test performance (Mean Squared Error) is calculated
for each model.Finally, we visualize the results by plotting the number of
estimators against the Mean Squared Error. This chart helps us identify the
optimal range of n_estimators by highlighting values where error is
minimized.
【Trivia】
‣ Random Forests, initially developed by Leo Breiman, were inspired by
the concept of bagging (bootstrap aggregating) and the random subspace
method.‣ Increasing the number of estimators typically reduces the
variance in predictions, but does not necessarily lower the bias.
3. Bias-Variance Tradeoff in Predictive Modeling
Importance★★★★★
Difficulty★★★☆☆
You are working as a data scientist for a retail company. The company
wants to predict future sales based on historical data, but they are facing
issues with overfitting and underfitting. Your task is to demonstrate the
bias-variance tradeoff using a simple polynomial regression model.Generate
synthetic sales data for a specific product, and fit the data using polynomial
regression models of different degrees (1, 3, and 9). Plot the training error
and the testing error to visualize how bias and variance affect the model
performance.Explain which model shows underfitting and which one shows
overfitting based on the bias-variance tradeoff.
【Data Generation Code Example】
import numpy as np

from sklearn.model_selection import train_test_split

# Generate synthetic sales data

X = np.linspace(0, 10, 100)

y = 3 * X ** 2 + 5 * X + np.random.normal(0, 5, size=100)

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import PolynomialFeatures

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

# Generate synthetic sales data


X = np.linspace(0, 10, 100)

y = 3 * X ** 2 + 5 * X + np.random.normal(0, 5, size=100)

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

# Reshape X to 2D array for fitting into model

X_train = X_train.reshape(-1, 1)

X_test = X_test.reshape(-1, 1)

# Create lists to store errors for plotting

degrees = [1, 3, 9]
train_errors = []

test_errors = []

# Loop through different polynomial degrees


for degree in degrees:

poly = PolynomialFeatures(degree)

X_train_poly = poly.fit_transform(X_train)

X_test_poly = poly.fit_transform(X_test)

model = LinearRegression()

model.fit(X_train_poly, y_train)

y_train_pred = model.predict(X_train_poly)

y_test_pred = model.predict(X_test_poly)
train_errors.append(mean_squared_error(y_train, y_train_pred))

test_errors.append(mean_squared_error(y_test, y_test_pred))

# Plot the bias-variance tradeoff

plt.figure()

plt.plot(degrees, train_errors, label='Training Error')

plt.plot(degrees, test_errors, label='Testing Error')

plt.xlabel('Model Complexity (Polynomial Degree)')

plt.ylabel('Mean Squared Error')

plt.title('Bias-Variance Tradeoff')
plt.legend()

plt.show()

In this exercise, the goal is to understand how model complexity impacts


performance through the bias-variance tradeoff.To begin with, synthetic
data is generated based on a quadratic function with added noise to simulate
real-world sales data.This noisy data helps reflect the randomness and
variability seen in actual datasets.Once the data is generated, it is split into
training and testing sets to allow for model evaluation. The training data
will be used to fit the models, and the testing data will evaluate how well
the models generalize to unseen data.A polynomial regression model is
used because it can represent increasing levels of model complexity by
adding polynomial terms to the linear regression. In this example, models
with degrees of 1, 3, and 9 are tested. A degree of 1 is a simple linear
model, while a degree of 9 fits a more complex curve that could overfit the
data.For each polynomial degree, the training data is transformed into a
higher-dimensional feature space, and the model is trained to predict the
output. Afterward, the model makes predictions on both the training and
testing data.The mean squared error (MSE) is calculated for both the
training and testing data to measure how far the predictions are from the
actual values.The training error shows how well the model fits the data it
has seen, while the testing error measures how well the model generalizes
to new data.The plot reveals the bias-variance tradeoff:‣ For the degree-1
model (low complexity), the training and testing errors are both high. This
indicates underfitting, where the model cannot capture the data patterns,
leading to high bias.‣ For the degree-9 model (high complexity), the
training error is very low, but the testing error is high. This is a sign of
overfitting, where the model learns the noise in the training data, resulting
in high variance.‣ The degree-3 model shows a balance, with lower training
and testing errors, meaning it better captures the underlying data patterns
without overfitting.This demonstrates that increasing model complexity
reduces bias but increases variance, and finding the right balance is crucial
for optimal model performance.
【Trivia】
The bias-variance tradeoff is a key concept in machine learning, explaining
the tension between making a model too simple (high bias) and too
complex (high variance). It is a fundamental challenge in predictive
modeling and helps guide the choice of model complexity.
4. Visualizing the Impact of Data Augmentation on
Image Classification
Importance★★★★★
Difficulty★★★☆☆
A client in the retail fashion industry wants to improve their image
classification model for detecting clothing types in photos. However, they
only have a small dataset and are worried about overfitting. To combat this,
they plan to use data augmentation techniques.Your task is to implement
data augmentation on a small dataset of grayscale images and then visualize
the augmented images. Generate a dataset of 10 sample grayscale images
and apply augmentations such as rotation, flipping, and zooming.Visualize
the original and augmented images in a single plot to analyze the effects of
augmentation. This will help the client understand how augmented images
are different from the original data.
【Data Generation Code Example】
import numpy as np

from tensorflow.keras.preprocessing.image import ImageDataGenerator

import matplotlib.pyplot as plt

# Create 10 random grayscale images of 64x64 pixels

np.random.seed(42)

images = np.random.rand(10, 64, 64, 1)


【Diagram Answer】

【Code Answer】
import numpy as np

from tensorflow.keras.preprocessing.image import ImageDataGenerator

import matplotlib.pyplot as plt

## Generating random grayscale images

np.random.seed(42)

images = np.random.rand(10, 64, 64, 1)

## Set up data augmentation

datagen = ImageDataGenerator(rotation_range=30,
width_shift_range=0.2, height_shift_range=0.2, zoom_range=0.2,
horizontal_flip=True)

## Apply the augmentation to the images

augmented_images = [datagen.random_transform(images[i]) for i in


range(10)]

## Plot the original and augmented images


plt.figure(figsize=(10, 4))

for i in range(10):
## Original images

plt.subplot(2, 10, i + 1)

plt.imshow(images[i].reshape(64, 64), cmap='gray')

plt.axis('off')

## Augmented images

plt.subplot(2, 10, i + 11)

plt.imshow(augmented_images[i].reshape(64, 64), cmap='gray')

plt.axis('off')

plt.suptitle('Original vs Augmented Images')

plt.show()

In this problem, we tackle the common issue of overfitting in image


classification by using data augmentation.Data augmentation generates
variations of the original data, helping the model generalize better to unseen
data. Here, we use the ImageDataGenerator class from TensorFlow's Keras
API to apply several transformations, including rotation, flipping, and
zooming. These operations simulate different perspectives of an image,
improving the robustness of the model.We start by generating a small
dataset of 10 random grayscale images (64x64 pixels), which serves as the
base for applying augmentation. Then, we define a datagen object to specify
the augmentation parameters, including:‣ rotation_range: This randomly
rotates the images within a range of 30 degrees.‣ width_shift_range and
height_shift_range: These shift the images horizontally or vertically by up
to 20% of the image’s width or height.‣ zoom_range: This randomly zooms
into the images by up to 20%.‣ horizontal_flip: This randomly flips images
horizontally.Next, we apply these transformations to each of the images
using the random_transform method.To visualize the effect of the
augmentation, we plot both the original and augmented versions of the
images side by side. The top row shows the original 10 images, while the
bottom row displays the augmented ones.This process helps illustrate how
data augmentation artificially increases the size of the dataset and
introduces variability, which is essential for training robust machine
learning models.
【Trivia】
Data augmentation is a crucial technique in deep learning for improving
model generalization, especially when working with small datasets. It can
be used in various domains beyond images, including text and speech,
where transformations like word swapping or pitch alteration are applied
for similar benefits.
5. Visualizing Multicollinearity Using VIF in a
Marketing Dataset
Importance★★★★☆
Difficulty★★★☆☆
A marketing agency wants to understand how different advertising channels
contribute to product sales. They have collected data on TV, Radio, and
Newspaper ad spending, but they suspect that some of the channels might
be highly correlated, leading to multicollinearity in their regression
model.You are tasked with identifying and visualizing multicollinearity in
the dataset by calculating the Variance Inflation Factor (VIF) for each
advertising channel. Create a scatter matrix to visualize the relationships
between the variables, then compute and display the VIF for each feature to
help the marketing team decide which variable may need to be
adjusted.Using Python, generate the dataset and follow the steps outlined.
Your goal is to create a plot that visualizes the pairwise relationships and
VIF.
【Data Generation Code Example】
import numpy as np

import pandas as pd

# Generate random data for TV, Radio, and Newspaper ad spend

np.random.seed(42)

TV = np.random.normal(300, 50, 100)

Radio = TV * 0.6 + np.random.normal(30, 10, 100)


Newspaper = np.random.normal(50, 15, 100)

# Create a DataFrame with these features

data = pd.DataFrame({'TV': TV, 'Radio': Radio, 'Newspaper':


Newspaper})
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt


import seaborn as sns

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Generate random data for TV, Radio, and Newspaper ad spend

np.random.seed(42)

TV = np.random.normal(300, 50, 100)

Radio = TV * 0.6 + np.random.normal(30, 10, 100)

Newspaper = np.random.normal(50, 15, 100)

# Create a DataFrame with these features

data = pd.DataFrame({'TV': TV, 'Radio': Radio, 'Newspaper':


Newspaper})
# Visualize pairwise relationships between the variables

sns.pairplot(data)

plt.suptitle("Scatter Matrix of Advertising Channels", y=1.02)


plt.show()

# Calculate VIF to check multicollinearity

X = data

vif_data = pd.DataFrame()

vif_data["Feature"] = X.columns

vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in


range(len(X.columns))]

print(vif_data)
The objective of this task is to help users understand multicollinearity using
Variance Inflation Factor (VIF), which can occur when two or more
predictors in a regression model are highly correlated. High
multicollinearity can cause problems in model interpretation and prediction
accuracy. In this exercise, the dataset consists of three advertising channels
(TV, Radio, and Newspaper), and the goal is to visualize their relationships
using a scatter matrix and compute the VIF to detect any
multicollinearity.The pairplot from the Seaborn library helps visualize the
relationships between the advertising channels. If the scatter plots between
any two features show a clear linear pattern, it suggests that those variables
are correlated, which could lead to multicollinearity.The Variance Inflation
Factor (VIF) quantifies how much the variance of the regression coefficient
is inflated due to multicollinearity. A VIF value above 5–10 is considered
problematic, indicating that the feature is highly correlated with others in
the dataset and might need to be removed or adjusted.In the provided
Python code, the variance_inflation_factor function is used to calculate VIF
for each feature (TV, Radio, Newspaper). The output shows which feature
contributes most to multicollinearity. If a feature's VIF is high, the
marketing team may need to reconsider how that channel is measured or its
importance in the regression model.This exercise is a useful way to
illustrate one of the core challenges in regression analysis: multicollinearity
and its detection. It is especially valuable in marketing, where different ad
channels often influence each other.
【Trivia】
The concept of multicollinearity in regression was first introduced in 1934
by Ragnar Frisch, a Nobel Prize-winning economist.
6. Feature Selection Techniques and Their Impact
on Model Performance
Importance★★★★★
Difficulty★★★★☆
You are working for a healthcare startup that wants to predict whether a
patient will develop diabetes based on various features such as age, BMI,
and blood pressure.Your task is to use feature selection techniques to
identify which features are most important for predicting diabetes.Then,
visualize the effect of these selected features on model performance using a
classifier (e.g., Logistic Regression).Please generate synthetic data using
the following features:AgeBMI (Body Mass Index)Blood PressureInsulin
LevelGlucose LevelAfter generating the data, perform feature selection
using any technique of your choice (e.g., Recursive Feature Elimination
(RFE), SelectKBest, etc.), and compare the model performance with and
without feature selection by plotting the results.
【Data Generation Code Example】
import numpy as np

import pandas as pd

from sklearn.datasets import make_classification

#Generate synthetic data for diabetes prediction

X, y = make_classification(n_samples=500, n_features=5,
n_informative=3, n_classes=2, random_state=42)

features = ['Age', 'BMI', 'BloodPressure', 'InsulinLevel', 'GlucoseLevel']


df = pd.DataFrame(X, columns=features)

df['Diabetes'] = y
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split


from sklearn.linear_model import LogisticRegression

from sklearn.feature_selection import RFE

from sklearn.metrics import accuracy_score


#Generate synthetic data for diabetes prediction

X, y = make_classification(n_samples=500, n_features=5,
n_informative=3, n_classes=2, random_state=42)

features = ['Age', 'BMI', 'BloodPressure', 'InsulinLevel', 'GlucoseLevel']

df = pd.DataFrame(X, columns=features)

df['Diabetes'] = y

#Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

#Initialize Logistic Regression model

model = LogisticRegression()

#Use Recursive Feature Elimination (RFE) for feature selection

selector = RFE(model, n_features_to_select=3)

selector = selector.fit(X_train, y_train)

#Get the selected features

selected_features = np.array(features)[selector.support_]

#Train the model with all features and selected features

model.fit(X_train, y_train)

y_pred_all = model.predict(X_test)

accuracy_all = accuracy_score(y_test, y_pred_all)

X_train_selected = selector.transform(X_train)

X_test_selected = selector.transform(X_test)
model.fit(X_train_selected, y_train)

y_pred_selected = model.predict(X_test_selected)

accuracy_selected = accuracy_score(y_test, y_pred_selected)

#Plot the model performance before and after feature selection

plt.bar(['All Features', 'Selected Features'], [accuracy_all,


accuracy_selected])

plt.title('Model Accuracy: All Features vs. Selected Features')

plt.ylabel('Accuracy')

plt.show()

This exercise involves performing feature selection and comparing the


impact of selected features on model performance.We first generated
synthetic data using the make_classification function, with five features:
Age, BMI, Blood Pressure, Insulin Level, and Glucose Level.This data
represents hypothetical patient information to predict diabetes presence
(binary classification: 0 or 1).We split the data into training and testing sets
using the train_test_split function to ensure we have a portion of the data
for training the model and another for evaluating its performance.The
Logistic Regression model is selected for this task because it's commonly
used for binary classification problems like diabetes prediction.The
Recursive Feature Elimination (RFE) technique is used to select the three
most important features.RFE recursively fits the model, eliminates the least
important feature, and repeats the process until the specified number of
features (in this case, 3) are selected.We then trained the Logistic
Regression model twice—once with all features and once with the selected
features—and calculated the accuracy for both scenarios using
accuracy_score.Finally, we created a bar chart to visualize the comparison
between the model's performance when all features are used versus when
only the selected features are used.This helps illustrate how feature
selection can improve or maintain model performance while reducing the
complexity of the model.
【Trivia】
Feature selection techniques are particularly useful in high-dimensional
datasets where too many features may lead to overfitting, slow down
training, and introduce noise.By selecting only the most relevant features,
you can simplify the model, make it more interpretable, and potentially
improve generalization to unseen data.
7. Time Series Decomposition for Sales Forecasting
Importance★★★★★
Difficulty★★★★☆
A retail company is facing a challenge of understanding its sales patterns
for better demand forecasting. They have recorded sales data over the past
few years, and they suspect that the data consists of seasonal trends, cyclic
patterns, and some irregularities.The goal is to decompose this time series
data to extract its underlying components and visualize them.Your task is to
simulate this sales data and apply time series decomposition. Create a
function that breaks the data into three parts: trend, seasonal, and residual
components, and then plot the results. Use Python libraries and ensure the
function plots all the components properly.
【Data Generation Code Example】
import numpy as np

import pandas as pd

np.random.seed(0)

# Create a time series with trend, seasonality, and noise

date_range = pd.date_range(start='2020-01-01', periods=365, freq='D')

sales_data = 50 + 0.2 * np.arange(365) + 10 * np.sin(np.linspace(0, 3 *


np.pi, 365)) + np.random.normal(0, 1, 365)

sales_df = pd.DataFrame({'Date': date_range, 'Sales': sales_data})


【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt


from statsmodels.tsa.seasonal import seasonal_decompose

np.random.seed(0)

date_range = pd.date_range(start='2020-01-01', periods=365, freq='D')

sales_data = 50 + 0.2 * np.arange(365) + 10 * np.sin(np.linspace(0, 3 *


np.pi, 365)) + np.random.normal(0, 1, 365)
sales_df = pd.DataFrame({'Date': date_range, 'Sales': sales_data})

# Convert date to index for decomposition

sales_df.set_index('Date', inplace=True)

# Decompose the time series

result = seasonal_decompose(sales_df['Sales'], model='additive',


period=30)
# Plot the decomposition

result.plot()

plt.show()

Time series decomposition is a useful technique in machine learning for


understanding the structure of data over time. In this case, we decompose
the sales data into three primary components:‣ Trend: This shows the
overall direction of the data over time, helping us observe whether sales are
increasing or decreasing.‣ Seasonal: This reflects repeating patterns over
regular intervals, such as monthly or weekly sales fluctuations.‣ Residual:
These are the random variations that cannot be explained by the trend or
seasonality.First, we import necessary libraries such as numpy, pandas, and
matplotlib. The statsmodels.tsa.seasonal module contains the
seasonal_decompose function, which allows us to decompose the time
series. The decomposition is performed using an additive model, meaning
that the components (trend, seasonal, residual) are summed to form the
original time series.The data we use here is simulated to have a clear
seasonal pattern, trend, and some noise. We simulate a sales trend
increasing over time, add a sinusoidal function to introduce seasonality, and
finally introduce random noise to represent the irregular component. After
that, we convert the Date column into the index for proper time series
analysis and perform the decomposition using the seasonal_decompose
function. Finally, we plot the decomposition results to visualize each
component.This method is crucial for machine learning tasks like
forecasting, anomaly detection, or understanding the nature of business data
in a time series format.
【Trivia】
The concept of time series decomposition dates back to early 20th-century
econometrics, where researchers began separating time series into long-
term trends and periodic behaviors.
8. Building and Visualizing a Movie
Recommendation System using Collaborative
Filtering
Importance★★★★☆
Difficulty★★★☆☆
Imagine you are tasked with creating a recommendation system for a movie
streaming platform.The platform has user ratings for various movies, and
your goal is to predict user preferences based on historical data.You are
asked to create a collaborative filtering-based recommendation system
using matrix factorization (SVD).Additionally, you need to visualize the
prediction accuracy by plotting the error (RMSE) after training the
model.Use a simulated dataset of user ratings.You need to:Create a
synthetic dataset of user ratings for at least 6 users and 6 movies.Implement
matrix factorization (SVD) to predict the ratings.Calculate and display the
RMSE.Plot the predicted vs actual ratings for visualization purposes.
【Data Generation Code Example】
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

## Generating a random user-item rating matrix

np.random.seed(42)

ratings = pd.DataFrame(np.random.randint(1, 6, size=(6, 6)),

columns=[f'Movie_{i}' for i in range(1, 7)],

index=[f'User_{i}' for i in range(1, 7)])

## Flattening the matrix to make it suitable for training

ratings = ratings.stack().reset_index()
ratings.columns = ['User', 'Movie', 'Rating']

## Splitting data into train and test sets

train_data, test_data = train_test_split(ratings, test_size=0.2,


random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

from sklearn.decomposition import TruncatedSVD

from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

## Generating synthetic data


np.random.seed(42)

ratings = pd.DataFrame(np.random.randint(1, 6, size=(6, 6)),

columns=[f'Movie_{i}' for i in range(1, 7)],

index=[f'User_{i}' for i in range(1, 7)])

ratings = ratings.stack().reset_index()

ratings.columns = ['User', 'Movie', 'Rating']

train_data, test_data = train_test_split(ratings, test_size=0.2,


random_state=42)

## Building a utility matrix for collaborative filtering

utility_matrix = train_data.pivot(index='User', columns='Movie',


values='Rating').fillna(0)

## Performing Singular Value Decomposition (SVD)

svd = TruncatedSVD(n_components=2)

user_factors = svd.fit_transform(utility_matrix)

movie_factors = svd.components_

## Predicting ratings for the test set

predictions = [np.dot(user_factors[utility_matrix.index.get_loc(u)],
movie_factors[:, utility_matrix.columns.get_loc(m)])

for u, m in zip(test_data['User'], test_data['Movie'])]

## Calculating RMSE

rmse = np.sqrt(mean_squared_error(test_data['Rating'], predictions))

print(f'RMSE: {rmse:.2f}')
## Visualizing predicted vs actual ratings

plt.scatter(test_data['Rating'], predictions)

plt.plot([1, 5], [1, 5], color='red', linestyle='--')

plt.xlabel('Actual Ratings')

plt.ylabel('Predicted Ratings')

plt.title('Actual vs Predicted Ratings')

plt.show()

The task here is to build a simple recommendation system using


collaborative filtering based on matrix factorization, specifically Singular
Value Decomposition (SVD).
Collaborative filtering is a popular method used to recommend items to
users by leveraging past interactions or preferences of similar users.
To begin, a synthetic dataset of user ratings is created for simplicity. Each
user is assumed to have rated several movies on a scale of 1 to 5.
This dataset is split into training and testing subsets to simulate the process
of training the model on past data and predicting future preferences.
The utility matrix is created where rows represent users, and columns
represent movies. Each cell in the matrix contains a user's rating for a
particular movie.
Matrix factorization via SVD is then applied to decompose this matrix into
two smaller matrices—one representing the latent features of users and the
other representing the latent features of movies.
The dot product of these two matrices helps predict ratings for unseen
movies.
To evaluate the accuracy of the system, we calculate the Root Mean Square
Error (RMSE), a standard metric to measure the difference between
predicted and actual values.
A lower RMSE indicates that the system predicts ratings more accurately.
Finally, a plot of the actual versus predicted ratings helps visually
understand the model's performance.
In this example, the red dashed line represents a perfect prediction. If the
points lie close to this line, it indicates that the predicted ratings closely
match the actual ratings.
【Trivia】
Collaborative filtering is used by major platforms such as Netflix and
Amazon to recommend products and movies. The Netflix Prize competition
in 2006 significantly advanced research in this field, with the winning team
improving the recommendation accuracy by over 10%.
9. Exploratory Data Analysis Using Facet Grids for
Customer Segmentation
Importance★★★★☆
Difficulty★★★☆☆
You are working with a retail company that wants to understand its
customers' purchasing patterns based on different customer segments.The
company has demographic information (age and gender) and purchasing
data (monthly spend) for each customer.Your task is to analyze the
relationship between customer demographics and their monthly
spending.Use Python to create a visualization using facet grids, where you
display the distribution of monthly spending by age, broken down by
gender.This will help the company to segment their customers and target
specific groups effectively.Generate the data yourself using Python,
where:‣ Age is a random integer between 18 and 65.‣ Gender is either
'Male' or 'Female'.‣ Monthly spend is a random value between 100 and
1000.Use the visualization to provide insights on spending behavior across
age groups and genders.
【Data Generation Code Example】
import numpy as np

import pandas as pd

np.random.seed(42)
ages = np.random.randint(18, 66, 200)

genders = np.random.choice(['Male', 'Female'], 200)

spend = np.random.uniform(100, 1000, 200)

data = pd.DataFrame({'Age': ages, 'Gender': genders, 'Monthly Spend':


spend})
【Diagram Answer】

【Code Answer】
import numpy as np
import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt


np.random.seed(42)

ages = np.random.randint(18, 66, 200)

genders = np.random.choice(['Male', 'Female'], 200)

spend = np.random.uniform(100, 1000, 200)

data = pd.DataFrame({'Age': ages, 'Gender': genders, 'Monthly Spend':


spend})

sns.set(style="whitegrid")

#Create a facet grid based on Gender


g = sns.FacetGrid(data, col="Gender", height=5, aspect=1)

g.map_dataframe(sns.scatterplot, x="Age", y="Monthly Spend")

g.set_axis_labels("Age", "Monthly Spend")

g.add_legend()

plt.subplots_adjust(top=0.85)

g.fig.suptitle('Customer Monthly Spend by Age and Gender')

plt.show()

This exercise focuses on analyzing customer data using Python and a


popular visualization library, Seaborn.You are tasked with creating a scatter
plot using facet grids to analyze how age and gender impact customer
spending.Facet grids are a great tool for visualizing multiple subsets of a
dataset, breaking it down by categories (in this case, gender).To begin, you
generate the data. Each customer is assigned a random age, a gender, and a
random monthly spending amount.The use of NumPy helps with efficient
generation of random data, while Pandas helps organize this data into a
manageable format (a DataFrame).In the solution code, Seaborn’s
FacetGrid is used to create separate scatter plots for each gender.The scatter
plot maps "Age" to the x-axis and "Monthly Spend" to the y-axis, showing
the distribution of spending within different age groups.This type of
visualization makes it easy to compare spending behaviors between males
and females, helping the company identify key customer segments.The
scatter plots provide insights into spending patterns by age and gender,
which could guide marketing strategies for specific demographics.For
example, if younger females tend to spend more, the company might focus
on promotions or products targeting that group.The FacetGrid is configured
with col="Gender", meaning a separate plot is created for each gender.The
map_dataframe function specifies that a scatter plot should be drawn for
each facet.The axis labels and titles are added for clarity, and the plt.show()
command displays the final plot.This is a valuable technique for exploring
how different categories within your data behave, especially when
analyzing customer demographics and segmentation.
【Trivia】
The FacetGrid feature in Seaborn is a powerful visualization tool, especially
useful for breaking down data by categories such as gender, region, or
product type.It automatically creates a grid of plots, making it easy to
compare different subsets of your data in a visually intuitive way.
10. Creating Synthetic Data for Predicting House
Prices
Importance★★★★★
Difficulty★★★☆☆
A real estate company wants to create a model to predict house prices based
on various features like house size, number of bedrooms, and distance from
the city center.You are required to generate synthetic data for this problem,
create a regression model, and visualize the relationship between the
predicted and actual house prices.Please write code to:Generate synthetic
data with features such as size (square footage), bedrooms (number of
bedrooms), and distance_from_city (miles).Train a linear regression model
using this data.Plot the predicted house prices against the actual house
prices to evaluate the model's performance.
【Data Generation Code Example】
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

# Generate synthetic data

np.random.seed(42)

size = np.random.randint(800, 4000, 500) # house size in sq. ft

bedrooms = np.random.randint(1, 6, 500) # number of bedrooms

distance_from_city = np.random.uniform(1, 50, 500) # distance from city


center in miles

# Generate target variable (house prices) with some noise


price = (size * 200) + (bedrooms * 10000) - (distance_from_city * 500) +
np.random.normal(0, 25000, 500)

# Combine features into a dataframe

df = pd.DataFrame({'size': size, 'bedrooms': bedrooms,


'distance_from_city': distance_from_city, 'price': price})
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt

# Generate synthetic data

np.random.seed(42)
size = np.random.randint(800, 4000, 500) # house size in sq. ft

bedrooms = np.random.randint(1, 6, 500) # number of bedrooms

distance_from_city = np.random.uniform(1, 50, 500) # distance from city


center in miles

price = (size * 200) + (bedrooms * 10000) - (distance_from_city * 500) +


np.random.normal(0, 25000, 500)

df = pd.DataFrame({'size': size, 'bedrooms': bedrooms,


'distance_from_city': distance_from_city, 'price': price})

# Split data into training and testing sets


X = df[['size', 'bedrooms', 'distance_from_city']]

y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

# Train linear regression model

model = LinearRegression()

model.fit(X_train, y_train)

# Predict prices on test set

y_pred = model.predict(X_test)

# Plot predicted vs actual house prices

plt.scatter(y_test, y_pred)

plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)],


color='red') # 45-degree line

plt.title('Predicted vs Actual House Prices')


plt.xlabel('Actual Prices')

plt.ylabel('Predicted Prices')

plt.show()

This exercise demonstrates how to use synthetic data for a machine learning
task. Synthetic data is useful for testing models when real data isn't
available or is limited.First, we generate features like size, bedrooms, and
distance_from_city. These features are created using NumPy's random
functions to simulate real-world values. The target variable, price, is
calculated using a simple linear formula with some added noise to simulate
realistic data. The relationship between the features and price includes
positive correlations (more size and bedrooms increase price) and negative
correlations (greater distance from the city reduces price).After generating
the data, we split it into training and testing sets using train_test_split. This
ensures the model is trained on one portion of the data and tested on
another, preventing overfitting.Next, we use a LinearRegression model,
which fits a linear equation to the data. The model learns the coefficients of
the equation from the training data.Once trained, the model predicts house
prices based on the test data. We then plot the predicted prices against the
actual prices. The 45-degree red line on the plot represents perfect
predictions, and the closer the points are to this line, the better the model's
performance.Finally, this plot provides a visual assessment of how well the
model generalizes to unseen data, helping to understand the quality of the
model's predictions. This exercise highlights key steps in machine learning:
data generation, model training, and result visualization.
【Trivia】
Did you know that synthetic data generation is often used in fields like
autonomous driving? Self-driving car companies frequently generate
synthetic data to simulate various driving conditions that are too rare or
dangerous to collect in the real world.
11. Understanding the Impact of Evaluation
Metrics on Model Selection
Importance★★★★☆
Difficulty★★★☆☆
Imagine you are working for an e-commerce company that wants to
improve its product recommendation system.You have been tasked with
evaluating two machine learning models to predict user purchases based on
user data.However, you are unsure which evaluation metric (accuracy,
precision, or recall) should guide your decision.Your goal is to visualize
how different evaluation metrics impact model selection.For simplicity,
generate synthetic classification data and compare two models: a logistic
regression model and a decision tree classifier.Create a plot that shows how
these models perform using the three different metrics.
【Data Generation Code Example】
from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=10, n_classes=2,


random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)
【Diagram Answer】

【Code Answer】
import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score, precision_score,


recall_score
# Creating synthetic data
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2,
random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

# Training logistic regression and decision tree models

model1 = LogisticRegression()

model2 = DecisionTreeClassifier()

model1.fit(X_train, y_train)

model2.fit(X_train, y_train)

# Making predictions

y_pred1 = model1.predict(X_test)

y_pred2 = model2.predict(X_test)

# Calculating metrics

accuracy1, precision1, recall1 = accuracy_score(y_test, y_pred1),


precision_score(y_test, y_pred1), recall_score(y_test, y_pred1)

accuracy, precision2, recall2 = accuracy_score(y_test, y_pred2),


precision_score(y_test, y_pred2), recall_score(y_test, y_pred2)

# Preparing data for plotting

metrics = ['Accuracy', 'Precision', 'Recall']

logistic_scores = [accuracy1, precision1, recall1]

tree_scores = [accuracy, precision2, recall2]


# Plotting the comparison

plt.figure(figsize=(8, 6))
plt.plot(metrics, logistic_scores, label='Logistic Regression', marker='o')

plt.plot(metrics, tree_scores, label='Decision Tree', marker='s')

plt.title('Model Performance Comparison')

plt.ylabel('Score')

plt.legend()

plt.grid(True)

plt.show()

In this exercise, the task is to evaluate two machine learning models—


Logistic Regression and Decision Tree—based on three different metrics:
accuracy, precision, and recall.First, we generate synthetic binary
classification data using make_classification. This function creates a dataset
suitable for binary classification tasks.We then split the data into training
and test sets using train_test_split to avoid overfitting and ensure that we
are evaluating model performance on unseen data.Next, we train two
different models: a logistic regression model and a decision tree
classifier.Both models are trained on the same dataset, and their predictions
are evaluated using three metrics: accuracy, precision, and recall.Accuracy
measures the overall correctness of the model by calculating the ratio of
correctly predicted instances to the total instances.Precision focuses on the
ability of the model to correctly predict positive instances out of all the
instances it predicted as positive.Recall measures how well the model
identifies all relevant positive instances from the actual positive cases.We
then visualize the results by plotting the performance of both models across
these metrics.The plot gives us an understanding of how these models
perform in different scenarios and provides insights into which metric might
be more appropriate depending on the business goal.For instance, in some
cases, recall might be more important if missing positive instances is costly.
In others, precision might matter more to avoid false positives.This
visualization aids in making an informed decision on which model to select
based on the chosen evaluation metric.
【Trivia】
Did you know that precision and recall are often combined into a single
metric called the F1-score?This is a harmonic mean of precision and recall
and is especially useful when the class distribution is imbalanced!
12. Outlier Detection Techniques in Customer Data
Importance★★★★☆
Difficulty★★★☆☆
You are working for a retail company that collects customer purchase
data.The management has noticed some irregular patterns in the data, and
they suspect that certain records could be outliers that might distort their
analysis.Your task is to help them detect these outliers using machine
learning techniques.You will use synthetic data representing customer
spending and the number of transactions.After generating this dataset, apply
the Isolation Forest algorithm to identify potential outliers and visualize the
results.
【Data Generation Code Example】
import numpy as np

import pandas as pd

#Generate random customer data


np.random.seed(42)

data = np.random.normal(loc=[1000, 50], scale=[200, 10], size=(200, 2))

outliers = np.random.uniform(low=[3000, 5], high=[5000, 100], size=(10,


2))

data = np.vstack([data, outliers])

columns = ['Spending', 'Transactions']


df = pd.DataFrame(data, columns=columns)
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.ensemble import IsolationForest

#Generate customer data with outliers

np.random.seed(42)

data = np.random.normal(loc=[1000, 50], scale=[200, 10], size=(200, 2))


outliers = np.random.uniform(low=[3000, 5], high=[5000, 100], size=(10,
2))

data = np.vstack([data, outliers])

df = pd.DataFrame(data, columns=['Spending', 'Transactions'])

#Apply Isolation Forest to detect outliers

clf = IsolationForest(contamination=0.05)

df['outlier'] = clf.fit_predict(df[['Spending', 'Transactions']])

#Separate inliers and outliers for plotting

inliers = df[df['outlier'] == 1]

outliers = df[df['outlier'] == -1]


#Plot the results

plt.scatter(inliers['Spending'], inliers['Transactions'], label='Inliers',


color='blue', alpha=0.6)

plt.scatter(outliers['Spending'], outliers['Transactions'], label='Outliers',


color='red', alpha=0.6)

plt.title('Outlier Detection using Isolation Forest')

plt.xlabel('Customer Spending')

plt.ylabel('Number of Transactions')

plt.legend()

plt.show()

This exercise demonstrates how to detect outliers in a dataset using the


Isolation Forest algorithm.Outliers are data points that deviate significantly
from other points in the dataset. In real-world scenarios, outliers can
indicate fraudulent activities, unusual customer behavior, or data errors.In
the code, we first generate synthetic customer data, where most customers
spend around $1000 and make around 50 transactions, but we also add a
few outliers with much higher spending and unusual transaction
numbers.The IsolationForest class is a popular method for detecting
outliers. It works by isolating anomalies instead of modeling normal data
distribution.We set the contamination parameter to 0.05, meaning we
assume that 5% of the data are outliers.After training the model, the
fit_predict method assigns a label to each data point:‣ 1: Inlier‣ -1:
OutlierWe then use matplotlib to visualize the inliers and outliers, plotting
spending on the x-axis and the number of transactions on the y-axis.In the
plot, inliers are shown in blue, while outliers are shown in red.This
visualization helps to clearly understand which customers might have
abnormal behavior that requires further investigation.
【Trivia】
The Isolation Forest algorithm is particularly effective for high-dimensional
data because its complexity scales linearly with the size of the dataset.
Unlike other techniques, it does not require prior knowledge of the data
distribution, making it versatile across different domains.
13. Visualizing Cross-Validation Folds in Machine
Learning
Importance★★★★★
Difficulty★★★☆☆
A company is trying to optimize its machine learning model to predict
customer churn.They want to visualize how their data is split across
different folds during cross-validation to better understand the impact of
training/testing data selection.As a data scientist, your task is to create a
Python script that generates synthetic data and visualizes the cross-
validation folds in a classification problem using K-fold cross-
validation.Please visualize each fold of cross-validation on a 2D plot using
the Iris dataset and generate the fold splits visually to understand how the
dataset is being divided.Use the Iris dataset provided by scikit-learn,
perform a K-fold cross-validation (with 5 splits), and generate a plot that
shows how the data is distributed among these folds.The plot should show
different data points belonging to the training set and test set for each fold.
【Data Generation Code Example】
from sklearn.datasets import load_iris

from sklearn.model_selection import KFold

import numpy as np

# Load Iris dataset and generate feature/input data and labels

iris = load_iris()

X = iris.data
y = iris.target

# Create a KFold object with 5 splits

kf = KFold(n_splits=5, shuffle=True, random_state=42)


【Diagram Answer】

【Code Answer】
from sklearn.datasets import load_iris

from sklearn.model_selection import KFold

import matplotlib.pyplot as plt

import numpy as np

# Load Iris dataset and generate feature/input data and labels

iris = load_iris()

X = iris.data
y = iris.target

# Create a KFold object with 5 splits

kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Plot each fold's training and test data points

fold_number = 1

for train_index, test_index in kf.split(X):

plt.figure()

plt.title(f'Fold {fold_number} - Train/Test Split')

train_data, test_data = X[train_index], X[test_index]

# Visualize the training set points in blue

plt.scatter(train_data[:, 0], train_data[:, 1], color='blue', label='Train


set')

# Visualize the test set points in red


plt.scatter(test_data[:, 0], test_data[:, 1], color='red', label='Test set')

plt.xlabel('Feature 1')

plt.ylabel('Feature 2')

plt.legend()

plt.show()

fold_number += 1

In this exercise, we are performing K-fold cross-validation to split the


dataset into multiple training and testing sets.The KFold class from scikit-
learn helps in dividing the dataset into K different subsets, or "folds."In this
case, we use 5 folds, meaning the dataset is split into 5 equal parts, and the
model is trained 5 times, each time using a different fold as the test set and
the rest as the training set.The purpose of K-fold cross-validation is to
ensure the model generalizes well to unseen data by testing it on multiple
test sets, reducing overfitting to a particular portion of the data.By
visualizing the train and test sets across different folds, we can better
understand how the data is split and distributed during cross-validation.The
plot is generated for each fold, where blue points represent the training set
and red points represent the test set for that fold.The x and y axes represent
the first two features of the dataset.K-fold cross-validation is widely used in
machine learning, especially when the dataset is small, because it provides a
more reliable estimate of model performance than a simple train-test split.
【Trivia】
K-fold cross-validation can be extended to different types of data and
models.For instance, in time-series data, we use "TimeSeriesSplit" instead
of KFold to account for the temporal order of the data, preventing data
leakage from future to past.
14. Visualizing Clustering Results with Centroids in
Python
Importance★★★★☆
Difficulty★★★☆☆
A marketing company wants to group its customers into clusters based on
their purchasing patterns.They believe this will help them target each
cluster with tailored marketing strategies.You have been hired to analyze
customer purchasing data and provide visual insights by clustering the
customers and identifying the centroids of these clusters.Generate random
customer data (features) with 2 dimensions (representing two product
categories).Then, perform K-means clustering to group customers into 3
clusters.Finally, visualize the results with a scatter plot and display the
centroids of each cluster.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_blobs

# # Create random customer data (two features for product categories)

X, _ = make_blobs(n_samples=200, centers=3, n_features=2,


random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

from sklearn.datasets import make_blobs

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

# # Create random customer data (two features for product categories)

X, _ = make_blobs(n_samples=200, centers=3, n_features=2,


random_state=42)

# # Perform K-means clustering with 3 clusters


kmeans = KMeans(n_clusters=3)

kmeans.fit(X)

y_kmeans = kmeans.predict(X)

# # Extract the centroids

centroids = kmeans.cluster_centers_

# # Create a scatter plot of the clustered data

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

# # Mark the centroids on the plot

plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=200, marker='x')

plt.title('Customer Clustering with Centroids')

plt.xlabel('Product Category 1')

plt.ylabel('Product Category 2')

plt.show()

K-means clustering is a commonly used unsupervised machine learning


algorithm that partitions data into clusters based on similarity.
In this exercise, we have generated random customer data with two
features, each representing the purchasing behavior in different product
categories.
The make_blobs function from sklearn.datasets is used to generate synthetic
data points grouped around three centers, representing clusters of
customers.
K-means clustering works by assigning each data point to one of the
specified number of clusters (in this case, 3), based on minimizing the
distance between points and the centroid of the cluster.
The KMeans class from sklearn.cluster is used to perform this clustering.
We fit the model to the data using kmeans.fit(X), which identifies the
cluster assignments and the centroids.
The kmeans.predict(X) method returns the cluster label for each data point.
To visualize the clustering result, we use matplotlib to create a scatter plot.
Data points are color-coded based on the cluster to which they belong, and
the centroids of the clusters are marked with red 'x' markers.
This visualization allows us to see how customers are grouped, and where
the central points (centroids) of these groups lie. The centroids are
particularly important because they can represent typical or "average"
customers in each cluster.
This type of clustering is useful in marketing, as it helps identify distinct
groups of customers with similar behavior, allowing targeted marketing
strategies.
【Trivia】
K-means clustering assumes that clusters are roughly spherical and evenly
sized. If the data contains elongated or irregularly shaped clusters, K-means
might not perform well. Alternatives such as DBSCAN or Gaussian
Mixture Models (GMM) are better suited for such scenarios.
15. Visualizing Data Drift in Machine Learning
Models with Feature Comparison
Importance★★★★☆
Difficulty★★★☆☆
You are working for a retail company, and your machine learning model
predicts the demand for a product based on various factors such as price,
stock levels, and recent sales trends. However, you suspect that the data
distribution may have changed over time, causing data drift, which could
affect the model's accuracy.Your task is to visualize and detect data drift
between two datasets, one representing the old data (used during model
training) and the other representing new incoming data.You need to:‣
Generate two datasets representing product demand based on three features:
'price', 'stock', and 'sales_trend'. One dataset is for old data, and another is
for new data with slightly different distributions.‣ Create visualizations that
compare the distributions of the features in the old and new data to check
for potential drift.
【Data Generation Code Example】
import numpy as np

import pandas as pd

# # Create old data

old_data = pd.DataFrame({

'price': np.random.normal(50, 10, 500),

'stock': np.random.normal(200, 30, 500),


'sales_trend': np.random.normal(100, 20, 500)

})

# # Create new data with slight distribution shifts


new_data = pd.DataFrame({

'price': np.random.normal(55, 12, 500),

'stock': np.random.normal(210, 35, 500),

'sales_trend': np.random.normal(90, 25, 500)

})
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

# # Generate old data

old_data = pd.DataFrame({

'price': np.random.normal(50, 10, 500),

'stock': np.random.normal(200, 30, 500),

'sales_trend': np.random.normal(100, 20, 500)

})

# # Generate new data

new_data = pd.DataFrame({

'price': np.random.normal(55, 12, 500),

'stock': np.random.normal(210, 35, 500),

'sales_trend': np.random.normal(90, 25, 500)


})

# # Create subplots to compare feature distributions

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# # Compare price distribution

axes[0].hist(old_data['price'], bins=20, alpha=0.5, label='Old Data',


color='blue')

axes[0].hist(new_data['price'], bins=20, alpha=0.5, label='New Data',


color='red')

axes[0].set_title('Price Comparison')

axes[0].legend()

# # Compare stock distribution

axes[1].hist(old_data['stock'], bins=20, alpha=0.5, label='Old Data',


color='blue')

axes[1].hist(new_data['stock'], bins=20, alpha=0.5, label='New Data',


color='red')

axes[1].set_title('Stock Comparison')

axes[1].legend()

# # Compare sales trend distribution

axes[2].hist(old_data['sales_trend'], bins=20, alpha=0.5, label='Old Data',


color='blue')

axes[2].hist(new_data['sales_trend'], bins=20, alpha=0.5, label='New


Data', color='red')

axes[2].set_title('Sales Trend Comparison')


axes[2].legend()

# # Display the plot

plt.tight_layout()

plt.show()

In this exercise, you are visualizing data drift between two datasets to detect
any changes in feature distributions. Data drift refers to changes in data
patterns over time, which can reduce the performance of machine learning
models. Visualizing these changes is an important step in maintaining
model accuracy.
The two datasets are created to represent old data used for model training
and new data with slight changes in distributions. The 'price', 'stock', and
'sales_trend' features are generated using random normal distributions. In
the new data, the mean and standard deviation of the features are slightly
shifted to simulate real-world drift.
Histograms are used to visualize the distributions of the features in the old
and new datasets. The goal is to detect whether there are significant shifts in
these distributions. For each feature ('price', 'stock', and 'sales_trend'), you
can compare the overlap between the old and new data histograms to
visually inspect any drift. Large shifts in distributions indicate potential data
drift, which might require model retraining or adjustments to maintain
prediction accuracy.
This approach helps you understand how data drift affects your model and
provides a visual tool to monitor the health of your machine learning
system. Monitoring for data drift regularly is crucial for maintaining the
performance and reliability of your models in a changing data environment.
【Trivia】
Data drift is also known as "covariate shift" or "concept drift" in the
machine learning community. Models trained on old data often lose
accuracy when deployed in a production environment where the underlying
data distribution changes over time. Data drift detection tools help mitigate
this risk by alerting when the model performance may degrade due to
changing data.
16. Visualizing the Impact of Feature Engineering
on Classification Model Performance
Importance★★★★☆
Difficulty★★★☆☆
A retail company is trying to improve its customer segmentation model,
which classifies customers based on their spending habits. Currently, they
have basic data on customer demographics and purchase frequency.
However, they believe that creating new features such as average
transaction value or total purchases per month may improve the model's
performance.Your task is to create a sample dataset, engineer relevant
features, and compare the performance of a classification model before and
after feature engineering.Output a graph that shows the accuracy of the
model with and without the newly engineered features.
【Data Generation Code Example】
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.datasets import make_classification

# Generate sample data with make_classification

X, y = make_classification(n_samples=500, n_features=5,
random_state=42)

df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(5)])

df['target'] = y

# Create artificial purchase data

df['monthly_purchases'] = np.random.randint(1, 20, df.shape[0])

df['avg_transaction_value'] = np.random.uniform(10, 100, df.shape[0])


【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.datasets import make_classification

from sklearn.ensemble import RandomForestClassifier


from sklearn.metrics import accuracy_score

import matplotlib.pyplot as plt

# Generate sample data


X, y = make_classification(n_samples=500, n_features=5,
random_state=42)

df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(5)])

df['target'] = y

df['monthly_purchases'] = np.random.randint(1, 20, df.shape[0])

df['avg_transaction_value'] = np.random.uniform(10, 100, df.shape[0])

# Split the data into training and testing sets


X_train, X_test, y_train, y_test =
train_test_split(df.drop(columns='target'), df['target'], test_size=0.2,
random_state=42)

# Train and evaluate a model without feature engineering

clf = RandomForestClassifier(random_state=42)

clf.fit(X_train.iloc[:, :5], y_train)

y_pred = clf.predict(X_test.iloc[:, :5])

accuracy_before = accuracy_score(y_test, y_pred)

# Perform feature engineering

df['total_spent'] = df['monthly_purchases'] * df['avg_transaction_value']

# Train and evaluate a model with feature engineering

X_train, X_test, y_train, y_test =


train_test_split(df.drop(columns='target'), df['target'], test_size=0.2,
random_state=42)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
accuracy_after = accuracy_score(y_test, y_pred)

# Plot the results

plt.bar(['Without Feature Engineering', 'With Feature Engineering'],


[accuracy_before, accuracy_after])

plt.ylabel('Accuracy')

plt.title('Impact of Feature Engineering on Model Performance')

plt.show()

In this problem, we are tasked with improving the performance of a


machine learning model by applying feature engineering.The dataset is
artificially generated using make_classification, which simulates a binary
classification problem.The initial dataset has five features, which represent
basic characteristics of the customers. After creating the dataset, we split it
into training and testing sets to train and evaluate our models.First, we train
a RandomForestClassifier using the raw features without any modifications.
We calculate the accuracy of this model to establish a baseline for
comparison.Next, we apply feature engineering. We introduce a new feature
called total_spent, which is the product of monthly_purchases and
avg_transaction_value. This new feature captures an important aspect of
customer behavior that wasn't represented in the original dataset.After
creating this new feature, we retrain the model on the modified dataset and
evaluate its performance again. We observe the difference in accuracy
between the two models—one trained with the original features and one
trained with the engineered feature.Finally, we plot a bar chart to visualize
the accuracy improvement achieved through feature engineering. This
allows us to see the direct impact of adding a new feature to our model and
how it improves classification accuracy.Feature engineering is a critical part
of improving machine learning models, as it can reveal hidden patterns in
the data and provide the model with more informative inputs. This exercise
demonstrates how feature engineering can directly affect the performance
of machine learning models.
【Trivia】
Feature engineering is often considered the most important aspect of
machine learning because models are only as good as the data they are
trained on. Even with a simple algorithm, properly engineered features can
significantly boost model performance.
17. Creating a Custom Metric and Visualizing
Model Performance
Importance★★★☆☆
Difficulty★★★☆☆
A retail company wants to predict the future sales based on historical data
and measure the effectiveness of its new machine learning model.Your task
is to create a custom metric that represents the relative error of the model's
predictions, comparing them to the actual values.Afterward, visualize the
performance of the model using a plot that clearly shows both the predicted
and actual sales.For simplicity, the dataset will be generated within the code
itself, simulating actual sales data and the model's predictions.Focus on
calculating the custom metric and creating a graph that compares
predictions to actual data.
【Data Generation Code Example】
import numpy as np

np.random.seed(0)

actual_sales = np.random.randint(50, 200, size=30)

predicted_sales = actual_sales + np.random.normal(0, 20, size=30)


【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

np.random.seed(0)

actual_sales = np.random.randint(50, 200, size=30)

predicted_sales = actual_sales + np.random.normal(0, 20, size=30)

# Calculate the custom metric: relative error

relative_error = np.abs((actual_sales - predicted_sales) / actual_sales)

mean_relative_error = np.mean(relative_error)

print("Mean Relative Error:", mean_relative_error)


# Plotting the actual vs predicted sales

plt.figure(figsize=(10,6))

plt.plot(actual_sales, label='Actual Sales', marker='o')

plt.plot(predicted_sales, label='Predicted Sales', marker='x')

plt.xlabel('Day')

plt.ylabel('Sales')

plt.title('Actual vs Predicted Sales')

plt.legend()

plt.grid(True)

plt.show()

In this task, you are asked to predict future sales and measure model
performance using a custom metric.First, the CreateDataCode generates
two sets of sales data: actual sales and predicted sales. These values are
simulated using random numbers to mimic real-world sales data.We then
create a custom metric called "relative error," which measures the
difference between the actual and predicted sales as a percentage of the
actual sales. This metric helps us understand how far off the predictions are
from the true values.The relative error for each day is calculated as the
absolute difference between the actual and predicted values, divided by the
actual value. We then compute the mean relative error to get an overall
picture of model accuracy.After computing the metric, we plot the sales
data using matplotlib. The plt.plot() function is used to create a line plot,
where the actual sales and predicted sales are represented with different
markers. The plt.legend() function is added to label the lines, and the plot is
enhanced with gridlines using plt.grid(True).This visualization allows you
to clearly see how well the model performs by comparing the predicted and
actual sales.
【Trivia】
The relative error metric is widely used when evaluating model
performance because it is scale-independent, meaning it is useful whether
you are predicting small or large numbers.
18. Visualizing Class Imbalance in Binary
Classification with Python
Importance★★★★☆
Difficulty★★★☆☆
You are working with a credit card company that is dealing with fraudulent
transactions. The company has provided you with a dataset where
fraudulent transactions make up a very small portion of the total data,
leading to an imbalanced dataset. Your task is to visualize the imbalance of
the dataset using a bar chart and then fit a simple machine learning model to
classify fraudulent and non-fraudulent transactions.Create a synthetic
dataset representing transactions, where the class '1' represents fraud and
the class '0' represents non-fraud. Generate 1000 data points where only 5%
of the transactions are fraudulent.After visualizing the imbalance, fit a
logistic regression model to this dataset, and calculate the accuracy of the
model.
【Data Generation Code Example】
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# Generate synthetic data


np.random.seed(42)

n_samples = 1000

fraud_ratio = 0.05

data = np.random.rand(n_samples, 2)
labels = np.array([1 if i < int(n_samples * fraud_ratio) else 0 for i in
range(n_samples)])
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression


from sklearn.metrics import accuracy_score

# Generate synthetic data

np.random.seed(42)
n_samples = 1000

fraud_ratio = 0.05

data = np.random.rand(n_samples, 2)

labels = np.array([1 if i < int(n_samples * fraud_ratio) else 0 for i in


range(n_samples)])

# Visualize class imbalance

classes, counts = np.unique(labels, return_counts=True)


plt.bar(classes, counts, color=['blue', 'red'])

plt.xticks([0, 1], ['Non-Fraud', 'Fraud'])

plt.title('Class Imbalance in the Dataset')

plt.ylabel('Number of Transactions')

plt.show()

# Split data into train and test sets


X_train, X_test, y_train, y_test = train_test_split(data, labels,
test_size=0.2, random_state=42)

# Train logistic regression model

model = LogisticRegression()

model.fit(X_train, y_train)

# Make predictions

y_pred = model.predict(X_test)

# Calculate accuracy

accuracy = accuracy_score(y_test, y_pred)


print("Accuracy:", accuracy)

In this exercise, the goal is to simulate and visualize an imbalanced dataset,


where one class (fraudulent transactions) is much less frequent than the
other (non-fraudulent transactions).To begin, the data is created
synthetically using numpy, where 1000 data points represent transaction
information, and only 5% of these transactions are marked as fraudulent.
We set the fraud ratio to 0.05 to simulate the imbalance.Next, we visualize
the imbalance using a simple bar chart. By counting the number of
instances for each class (fraud and non-fraud), we can display this
information with matplotlib. This helps in understanding the magnitude of
the imbalance.After visualizing the dataset, we split the data into training
and testing sets using the train_test_split function from sklearn.We then
train a LogisticRegression model, which is a common model for binary
classification tasks. The model is trained using the training data, and its
predictions are made on the test data.Finally, the accuracy of the model is
calculated using accuracy_score, which provides a basic measure of how
well the model has performed. However, it is important to note that
accuracy is not always the best metric for imbalanced datasets, as it may
give misleading results when the majority class dominates the data.
【Trivia】
Did you know? Logistic regression is not actually a regression algorithm
but a classification algorithm, despite its name! It uses a logistic function to
model the probability of class membership.
19. Visualizing Predictions vs Actual Values Using
Linear Regression
Importance★★★★☆
Difficulty★★★☆☆
A real estate company wants to predict house prices based on historical
data.You are tasked with creating a model that predicts house prices using
features like square footage, number of bedrooms, and age of the
house.Visualize the predicted vs actual house prices using linear
regression.Generate the data within the code and display a plot that
compares predicted and actual values to see the model’s performance.
【Data Generation Code Example】
import numpy as np

import pandas as pd

np.random.seed(42)
X = np.random.rand(100, 3) * 100 # Square footage, number of
bedrooms, age

true_coef = [300, 500, -100]

y = X @ true_coef + np.random.randn(100) * 500 # Prices with some


noise
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

# # Generate synthetic data

np.random.seed(42)
X = np.random.rand(100, 3) * 100 # Square footage, number of
bedrooms, age

true_coef = [300, 500, -100]

y = X @ true_coef + np.random.randn(100) * 500 # Prices with some


noise

# # Split data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# # Train the model


model = LinearRegression()

model.fit(X_train, y_train)

# # Predict

y_pred = model.predict(X_test)

# # Visualize predicted vs actual values

plt.scatter(y_test, y_pred)

plt.xlabel('Actual Prices')

plt.ylabel('Predicted Prices')

plt.title('Predicted vs Actual House Prices')

plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)],


color='red') # Diagonal line

plt.show()

This exercise demonstrates how to build and visualize a linear regression


model to predict house prices.
We begin by generating random input data representing square footage,
number of bedrooms, and age of the houses.
The house prices are calculated by multiplying the features with the actual
coefficients, adding some random noise to simulate real-world data.
After creating the data, we split it into training and testing sets using
train_test_split to ensure the model can be evaluated on unseen data.
A linear regression model is then trained using LinearRegression(). This
model learns the relationship between the input features and house prices
based on the training set.
Once trained, the model predicts the house prices for the test data. The
predicted values are then compared to the actual values by plotting them on
a scatter plot.
In the plot, the x-axis represents the actual house prices, and the y-axis
represents the predicted prices. A red diagonal line is added to indicate the
ideal scenario where predicted prices exactly match the actual prices. If
points lie along the red line, it shows that the model made perfect
predictions.
By visualizing the data this way, it is easy to see how well the model
performs and whether it overestimates or underestimates house prices.
【Trivia】
Linear regression is one of the simplest and most widely used machine
learning models. However, it assumes a linear relationship between input
features and the target variable, which may not always be the case in real-
world scenarios.
20. Visualizing Feature Scaling Techniques for
Machine Learning
Importance★★★★☆
Difficulty★★★☆☆
You are working for a customer who wants to understand how scaling
affects machine learning algorithms.Your task is to visualize the impact of
feature scaling techniques, such as MinMaxScaler and StandardScaler, on a
dataset that contains two features: height and weight.First, create a dataset
with 50 random samples of height and weight.Then, apply both scaling
techniques and visualize the results in a scatter plot for comparison.Ensure
that you plot the original data along with the scaled data using both
techniques.
【Data Generation Code Example】
import numpy as np

import pandas as pd

heights = np.random.randint(150, 200, 50)


weights = np.random.randint(50, 100, 50)

data = pd.DataFrame({'height': heights, 'weight': weights})


【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler, StandardScaler

heights = np.random.randint(150, 200, 50)


weights = np.random.randint(50, 100, 50)

data = pd.DataFrame({'height': heights, 'weight': weights})

scaler_minmax = MinMaxScaler()

scaler_standard = StandardScaler()

data_minmax = scaler_minmax.fit_transform(data)
data_standard = scaler_standard.fit_transform(data)

plt.figure(figsize=(10, 6))

plt.subplot(1, 3, 1)

plt.scatter(data['height'], data['weight'], color='blue', label='Original')

plt.title('Original Data')

plt.xlabel('Height')

plt.ylabel('Weight')
plt.legend()

plt.subplot(1, 3, 2)

plt.scatter(data_minmax[:, 0], data_minmax[:, 1], color='green',


label='MinMax Scaled')

plt.title('MinMax Scaled Data')

plt.xlabel('Height')

plt.ylabel('Weight')

plt.legend()

plt.subplot(1, 3, 3)

plt.scatter(data_standard[:, 0], data_standard[:, 1], color='red',


label='Standard Scaled')

plt.title('Standard Scaled Data')

plt.xlabel('Height')

plt.ylabel('Weight')

plt.legend()
plt.tight_layout()

plt.show()

In machine learning, feature scaling is essential because it brings all


features to a similar range, which ensures that algorithms work
correctly.There are two common techniques used for feature scaling:
MinMaxScaler and StandardScaler.MinMaxScaler transforms features by
scaling them between a given range, usually 0 and 1. This is done by
subtracting the minimum value of the feature and dividing by the range
(max - min).StandardScaler standardizes features by subtracting the mean
and dividing by the standard deviation, ensuring the features have zero
mean and unit variance. This method is commonly used when features
follow a Gaussian distribution.In this example, you are generating a dataset
containing height and weight data, and applying both scaling techniques to
visualize how the data changes.After scaling, you plot the original, MinMax
scaled, and Standard scaled data side by side using scatter plots.This
visualization helps you understand how each technique affects the scale and
distribution of your features, which is crucial for ensuring consistent model
performance.
【Trivia】
Did you know that algorithms like KNN and SVM are particularly sensitive
to feature scaling?Without scaling, features with larger ranges dominate the
distance calculations, leading to biased predictions.
21. Creating a Simple Recommender System
Visualization Based on Customer Preferences
Importance★★★★☆
Difficulty★★★☆☆
A retail company wants to develop a basic recommendation system to
suggest products to customers based on their preferences.You are given a
list of 10 customers and their ratings for 5 products. The goal is to create a
simple recommender system using collaborative filtering techniques to
suggest products to a specific customer based on the preferences of other
similar customers.Visualize the similarity between customers using a
heatmap.Your task is to:Create a dataset of customer ratings.Build a simple
recommender system using cosine similarity.Visualize the customer
similarity using a heatmap.Use this similarity data to suggest products for a
given customer.
【Data Generation Code Example】
import pandas as pd

import numpy as np

#Create random customer ratings data

customers = ['Customer_' + str(i) for i in range(1, 11)]

products = ['Product_' + str(i) for i in range(1, 6)]

np.random.seed(0)

ratings = np.random.randint(1, 6, size=(10, 5))

#Create DataFrame

df = pd.DataFrame(ratings, index=customers, columns=products)

df
【Diagram Answer】

【Code Answer】
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.metrics.pairwise import cosine_similarity

#Create random customer ratings data again


customers = ['Customer_' + str(i) for i in range(1, 11)]

products = ['Product_' + str(i) for i in range(1, 6)]

np.random.seed(0)

ratings = np.random.randint(1, 6, size=(10, 5))

df = pd.DataFrame(ratings, index=customers, columns=products)

#Calculate cosine similarity between customers

similarity_matrix = cosine_similarity(df)

similarity_df = pd.DataFrame(similarity_matrix, index=customers,


columns=customers)

#Plot heatmap to visualize customer similarity


plt.figure(figsize=(8, 6))

sns.heatmap(similarity_df, annot=True, cmap='coolwarm')

plt.title('Customer Similarity Heatmap')


plt.show()

#Suggest products based on similar customers

target_customer = 'Customer_1'

similar_customers =
similarity_df[target_customer].sort_values(ascending=False)

similar_customers

In this exercise, we are building a basic recommender system that utilizes


collaborative filtering based on cosine similarity between customer
preferences.Collaborative filtering works by identifying users with similar
preferences to recommend items they may like based on the behavior of
others. It is widely used in recommendation systems like those seen in e-
commerce and streaming platforms.Here’s a step-by-step explanation of the
process:‣ First, a dataset of customer ratings for 5 products is randomly
generated. This simulates the customer preferences for various products.‣
Cosine similarity is used to calculate the similarity between customers.
Cosine similarity is a popular metric for this purpose as it measures the
cosine of the angle between two vectors (in this case, the rating vectors). A
value of 1 indicates identical preferences, while a value closer to 0 suggests
that the customers are dissimilar.‣ The cosine similarity matrix is then
visualized using a heatmap, which makes it easy to see which customers
have similar product preferences. Heatmaps are ideal for visually
comparing such matrices and understanding the relationships between
different customers.‣ Finally, based on the similarity scores, we identify the
most similar customers to a given target customer (e.g., 'Customer_1'). This
is useful in real-world scenarios because it allows businesses to recommend
products that similar customers have liked, boosting engagement and
potential sales.By performing these steps, we can gain insights into
customer behavior and leverage these insights for effective product
recommendations. This method is computationally efficient and widely
applicable in real-world systems.
【Trivia】
Collaborative filtering is a cornerstone of recommendation systems. It
gained popularity in the mid-2000s when Netflix held a competition called
the "Netflix Prize" to improve their recommendation algorithm using this
technique.
22. Evaluating Clustering Performance Using
Davies-Bouldin Index
Importance★★★★☆
Difficulty★★★☆☆
A retail company is segmenting its customers to create personalized
marketing strategies. They want to evaluate the quality of their customer
segmentation model, which was built using a clustering algorithm.Your task
is to use K-means clustering to segment the customers into different
clusters, then evaluate the clustering performance using the Davies-Bouldin
Index.Create a synthetic dataset that simulates customer data (e.g., annual
income and spending score). Use K-means to cluster the customers and
visualize the result. Finally, calculate the Davies-Bouldin Index to assess
the model's quality.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_blobs

#Generate synthetic customer data with two features (income and


spending score)

data, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60,


random_state=0)
【Diagram Answer】

【Code Answer】
import numpy as np

from sklearn.datasets import make_blobs

from sklearn.cluster import KMeans

from sklearn.metrics import davies_bouldin_score

import matplotlib.pyplot as plt

#Generate synthetic customer data

data, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60,


random_state=0)
#Apply K-means clustering

kmeans = KMeans(n_clusters=4)

labels = kmeans.fit_predict(data)

#Calculate the Davies-Bouldin Index

db_index = davies_bouldin_score(data, labels)

print("Davies-Bouldin Index:", db_index)

#Visualize the clusters

plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],


s=300, c='red', label='Centroids')

plt.title('Customer Clustering with K-means')

plt.xlabel('Annual Income')

plt.ylabel('Spending Score')

plt.legend()

plt.show()

The Davies-Bouldin Index (DBI) is a metric used to evaluate the quality of


a clustering model. The index is based on the ratio of intra-cluster
dispersion to inter-cluster separation. A lower DBI indicates better
clustering performance, as it reflects more compact clusters that are better
separated.The problem first generates a synthetic dataset representing
customer data using make_blobs. This function creates data points around
specified centers, simulating real-world scenarios like customer
segmentation. After generating the data, K-means clustering is applied
using KMeans from the scikit-learn library.K-means is an unsupervised
machine learning algorithm that divides the data into a predefined number
of clusters by minimizing the distance between data points and their
corresponding cluster centers. The algorithm iteratively assigns each data
point to the nearest cluster center and adjusts the center positions.To assess
the model, the Davies-Bouldin Index is calculated using the
davies_bouldin_score function from scikit-learn. This score helps in
understanding how well-separated the clusters are, with a lower score
indicating better cluster separation.Finally, the clustering result is visualized
using a scatter plot, where each customer (data point) is colored based on its
assigned cluster. The cluster centroids are also shown in red, highlighting
the centers calculated by the K-means algorithm.This exercise teaches the
importance of evaluating clustering models using metrics like the Davies-
Bouldin Index, which provides insights into how well the clustering model
performs.
【Trivia】
The Davies-Bouldin Index is named after David L. Davies and Donald W.
Bouldin, who introduced this metric in 1979. It is a popular evaluation
measure because it considers both the compactness and separation of
clusters, making it widely used in the evaluation of clustering algorithms.
23. Visualizing Data Pipeline Workflow with
Machine Learning Integration
Importance★★★★☆
Difficulty★★★☆☆
You are tasked with helping a retail company understand the purchasing
patterns of their customers.The company has a dataset containing the
number of purchases made by customers across different product categories
over the past year.Your goal is to visualize these purchasing patterns, cluster
the customers based on their buying behaviors using a machine learning
model (K-Means clustering),and then plot the clusters to provide insights
for personalized marketing strategies.Generate synthetic data with 100
customers and 3 product categories ('electronics', 'groceries', 'fashion'),train
a K-Means model, and visualize the clustered data using a scatter plot.Each
point should represent a customer, with the clusters clearly marked.
【Data Generation Code Example】
import numpy as np

np.random.seed(42)

## Generate random data for 100 customers and 3 product categories

data = np.random.randint(1, 100, size=(100, 3))

customers = ['Customer_'+str(i) for i in range(1, 101)]

categories = ['electronics', 'groceries', 'fashion']


【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans


np.random.seed(42)

## Generate random data for 100 customers and 3 product categories

data = np.random.randint(1, 100, size=(100, 3))


customers = ['Customer_'+str(i) for i in range(1, 101)]

categories = ['electronics', 'groceries', 'fashion']

## Apply K-Means clustering

kmeans = KMeans(n_clusters=3)

labels = kmeans.fit_predict(data)

## Plot the data with clusters

plt.figure(figsize=(8,6))

plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis', label='Clusters')

plt.xlabel('Electronics Purchases')

plt.ylabel('Groceries Purchases')

plt.title('Customer Clusters Based on Purchase Patterns')

plt.colorbar()
plt.show()

In this exercise, we used a K-Means clustering model to group customers


based on their purchasing patterns across three product categories.
The KMeans algorithm is one of the most popular unsupervised machine
learning methods used for clustering.
It works by partitioning data points into a predefined number of clusters (in
this case, 3). Each customer is assigned to one of these clusters based on
their similarity to other customers in terms of purchase data.
To begin, we generated synthetic data for 100 customers, each with values
representing purchases in three categories: electronics, groceries, and
fashion.
The clustering algorithm was applied to this dataset to form 3 groups, after
which the results were visualized using a scatter plot.
On the scatter plot, each point corresponds to a customer, and the different
colors indicate the cluster to which they belong. This visualization helps in
identifying patterns and differences in customer behavior.
The x-axis and y-axis represent purchases made in electronics and
groceries, respectively, while the clusters are differentiated by color.
The goal of this task is not only to visualize the clusters but to understand
how the K-Means algorithm can be used to discover meaningful groupings
within customer data,
which can lead to targeted marketing strategies based on the identified
buying patterns.
【Trivia】
K-Means clustering works best when the data is well-separated and
relatively uniform.
If the clusters overlap significantly or if there are outliers, the algorithm
may struggle to assign data points to the correct cluster.
24. Visualizing Feature Interactions in a Predictive
Model
Importance★★★☆☆
Difficulty★★★☆☆
A retail company is trying to predict the sales of a new product based on
different features such as price, advertising budget, and store location.They
suspect that certain features interact in ways that impact the prediction.Your
task is to create a model that visualizes the interaction between the features
"price" and "advertising budget."Use a decision tree regressor and plot a 3D
surface plot to visualize how these two features interact to influence
predicted sales.Create a dataset that simulates different combinations of
price and advertising budget, then build the regressor model.Finally,
produce a 3D plot of the feature interaction to visualize how the prediction
changes as price and advertising budget vary.
【Data Generation Code Example】
import numpy as np

from sklearn.tree import DecisionTreeRegressor

## Simulate data for the problem

X_price = np.linspace(10, 100, 50)

X_advertising = np.linspace(1000, 5000, 50)

X_price, X_advertising = np.meshgrid(X_price, X_advertising)

X = np.c_[X_price.ravel(), X_advertising.ravel()]

y = 50 + 0.5 * X[:, 0] - 0.02 * X[:, 1] + 0.001 * X[:, 0] * X[:, 1] +


np.random.randn(*X[:, 0].shape) * 5
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeRegressor

from mpl_toolkits.mplot3d import Axes3D


## Generate dataset

X_price = np.linspace(10, 100, 50)

X_advertising = np.linspace(1000, 5000, 50)

X_price, X_advertising = np.meshgrid(X_price, X_advertising)

X = np.c_[X_price.ravel(), X_advertising.ravel()]
y = 50 + 0.5 * X[:, 0] - 0.02 * X[:, 1] + 0.001 * X[:, 0] * X[:, 1] +
np.random.randn(*X[:, 0].shape) * 5
## Train the Decision Tree Regressor

model = DecisionTreeRegressor()

model.fit(X, y)

## Predict sales

y_pred = model.predict(X)

y_pred = y_pred.reshape(X_price.shape)

## Plot the 3D surface to visualize feature interactions

fig = plt.figure()

ax = fig.add_subplot(111, projection='3d')

ax.plot_surface(X_price, X_advertising, y_pred, cmap='viridis')

ax.set_xlabel('Price')
ax.set_ylabel('Advertising Budget')

ax.set_zlabel('Predicted Sales')

plt.show()

In this exercise, the goal is to understand how two features, "price" and
"advertising budget," interact to affect sales predictions.A decision tree
regressor is used to predict sales based on these features, which means that
the algorithm will learn from the training data to capture non-linear
relationships between the features.We start by generating a dataset that
simulates possible combinations of price and advertising budget values.The
simulated dataset also includes some random noise to reflect realistic
data.The feature matrix X contains two features: price and advertising
budget, which we simulate using numpy.meshgrid.The target variable y is
generated based on a simple mathematical formula that introduces a small
interaction between the two features, simulating how real-world data might
behave.After generating the dataset, we use a DecisionTreeRegressor,
which is a type of model that can capture complex interactions and non-
linear relationships between features.The model is trained using the fit()
method, which learns the mapping from the input features to the target
variable (sales).Once the model is trained, predictions are made over the
input features using the predict() method.To visualize the interaction
between price and advertising budget, we use a 3D surface plot.This plot
shows how predicted sales change as both the price and the advertising
budget vary.The x-axis represents price, the y-axis represents advertising
budget, and the z-axis (height of the surface) represents predicted sales.This
visualization helps in understanding the model's prediction behavior in a
more intuitive way, particularly in terms of feature interactions.The model
will capture how the interaction between the two features influences the
outcome, making it easier to analyze the combined effect of price and
advertising budget on predicted sales.
【Trivia】
Did you know that decision tree regressors are not just used for regression
tasks, but also for classification?They work by splitting the data into
smaller subsets based on feature values, creating a tree structure where each
node represents a decision rule based on the features.This makes them
flexible and interpretable models, although they can sometimes overfit if
not properly tuned.
25. Visualizing the Impact of Regularization
Techniques on Overfitting in a Machine Learning
Model
Importance★★★★☆
Difficulty★★★☆☆
A company is developing a machine learning model to predict house prices
based on multiple features (e.g., size, number of rooms, location, etc.).They
are experiencing overfitting issues with their model, meaning it performs
well on the training data but poorly on unseen data.Your task is to help the
company understand how regularization techniques (L1, L2) can improve
generalization and reduce overfitting.Create a regression model using
synthetic data and visualize the impact of L1 (Lasso) and L2 (Ridge)
regularization on the coefficients of the features.Your output should include
a graph showing the effect of regularization on the learned coefficients as
the strength of regularization increases.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_regression

## Generate synthetic regression data

X, y = make_regression(n_samples=100, n_features=5, noise=10,


random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import Lasso, Ridge

from sklearn.datasets import make_regression

## Create synthetic regression data

X, y = make_regression(n_samples=100, n_features=5, noise=10,


random_state=42)

## Define different regularization strengths


alphas = np.logspace(-4, 1, 50)

## Initialize arrays to store coefficients for both Lasso and Ridge

lasso_coefs = [Lasso(alpha=a).fit(X, y).coef_ for a in alphas]

ridge_coefs = [Ridge(alpha=a).fit(X, y).coef_ for a in alphas]


## Convert lists to numpy arrays for easy plotting

lasso_coefs = np.array(lasso_coefs)
ridge_coefs = np.array(ridge_coefs)

## Plot the coefficients as a function of regularization strength

plt.figure(figsize=(12, 6))

## Plot Lasso regularization

plt.subplot(1, 2, 1)

for i in range(X.shape[1]):

plt.plot(alphas, lasso_coefs[:, i], label=f'Feature {i}')

plt.xscale('log')

plt.xlabel('Alpha (Regularization Strength)')

plt.ylabel('Coefficients')

plt.title('Lasso Regularization (L1)')

plt.legend()

## Plot Ridge regularization

plt.subplot(1, 2, 2)

for i in range(X.shape[1]):

plt.plot(alphas, ridge_coefs[:, i], label=f'Feature {i}')

plt.xscale('log')

plt.xlabel('Alpha (Regularization Strength)')

plt.ylabel('Coefficients')
plt.title('Ridge Regularization (L2)')

plt.legend()

plt.tight_layout()

plt.show()

Regularization is a key technique used to prevent overfitting in machine


learning models.Overfitting happens when a model learns not only the
underlying pattern of the data but also noise and random fluctuations.This
causes the model to perform poorly on new, unseen data.In machine
learning, two common regularization techniques are L1 regularization
(Lasso) and L2 regularization (Ridge).L1 regularization, or Lasso, adds a
penalty equal to the absolute value of the magnitude of the coefficients to
the cost function.This results in some of the feature coefficients being
reduced to exactly zero, effectively selecting a simpler model by excluding
some features entirely.L2 regularization, or Ridge, penalizes the squared
magnitude of the coefficients.This does not drive coefficients to zero but
instead shrinks them continuously, leading to smaller values.In the provided
code, synthetic regression data is generated with five features.Lasso and
Ridge models are then trained with different values of the regularization
parameter alpha, which controls the strength of regularization.As alpha
increases, the impact of regularization becomes stronger, shrinking the
coefficients more aggressively.In the graph, you can see how the
coefficients of the features change as the regularization strength increases
for both Lasso and Ridge.Lasso can be used for feature selection, as it tends
to push some coefficients to zero.Ridge, on the other hand, is better suited
for scenarios where all features should be retained but with smaller
coefficients.By visualizing the change in coefficients, we gain insight into
how each regularization technique impacts the model’s complexity and
prevents overfitting.
【Trivia】
Did you know that Lasso was first introduced by Robert Tibshirani in 1996
as a method for both regularization and variable selection?Meanwhile,
Ridge regression dates back even further to 1970, introduced by Arthur
Hoerl and Robert Kennard to deal with multicollinearity in regression
models.
26. Ensemble Learning: Visualizing Decision
Boundaries in Bagging and Boosting Methods
Importance★★★★☆
Difficulty★★★☆☆
You are a data scientist at an insurance company, and your task is to build a
reliable model for predicting whether a client will subscribe to a new
insurance product. Your goal is to test two ensemble methods—Bagging
and Boosting—on the same dataset and visualize how each method draws
decision boundaries for classification. The client has requested a visual
explanation, so you need to generate a figure that shows how these two
methods handle classification. Use synthetic data to simulate the problem,
where the two classes are separable but with some overlap.Create a
synthetic dataset using the make_classification function from sklearn with 2
informative features and 2 classes.Then, apply Bagging with a
DecisionTreeClassifier and Boosting with AdaBoostClassifier on this
data.Finally, generate a plot showing the decision boundaries for both
classifiers.Your task:Create the dataset using make_classification.Train both
classifiers.Visualize the decision boundaries of both Bagging and Boosting
on the same plot for comparison.
【Data Generation Code Example】
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=300, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1,
random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier

from matplotlib.colors import ListedColormap

X, y = make_classification(n_samples=300, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1,
random_state=42)

# Create Bagging and Boosting classifiers

clf_bag = BaggingClassifier(DecisionTreeClassifier(), n_estimators=50,


random_state=42)
clf_boost = AdaBoostClassifier(DecisionTreeClassifier(),
n_estimators=50, random_state=42)

# Train both models

clf_bag.fit(X, y)

clf_boost.fit(X, y)

# Define function to plot decision boundaries

def plot_decision_boundary(model, X, y, ax, title):

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),


np.arange(y_min, y_max, 0.1))

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

ax.contourf(xx, yy, Z, alpha=0.4, cmap=ListedColormap(('red',


'blue')))

ax.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', marker='o', s=30,


cmap=ListedColormap(('red', 'blue')))

ax.set_title(title)

# Create the plot

fig, axs = plt.subplots(1, 2, figsize=(14, 6))

plot_decision_boundary(clf_bag, X, y, axs[0], "Bagging Decision


Boundary")
plot_decision_boundary(clf_boost, X, y, axs[1], "Boosting Decision
Boundary")

plt.show()

Ensemble methods are techniques that combine multiple machine learning


models to create a more powerful predictive model.Two commonly used
ensemble techniques are Bagging and Boosting.Bagging (Bootstrap
Aggregating) involves training several independent models (usually
Decision Trees) on different subsets of the training data, then averaging
their predictions. This method reduces variance and helps prevent
overfitting.Boosting, on the other hand, is a sequential method where each
model focuses on the errors made by previous models, gradually improving
performance by "boosting" the weak learners.In this exercise, the
BaggingClassifier creates an ensemble of Decision Trees, and each tree is
trained on random subsets of the data. Bagging works well for reducing
variance in high-variance models like Decision Trees, giving more stable
predictions.The AdaBoostClassifier is used for Boosting. This method adds
new models iteratively, where each model corrects the mistakes made by
the previous one. Boosting typically improves the model's accuracy, though
it may increase the risk of overfitting if not controlled properly.The decision
boundaries generated by Bagging are usually smoother because of the
averaged decisions of multiple models. Boosting decision boundaries can
be more complex since it focuses on correcting individual
misclassifications, leading to potentially tighter and more accurate
boundaries around difficult-to-classify instances.The plot visualizes these
boundaries, showing how each method handles the same classification
problem differently. Bagging produces more stable, smoothed-out decision
regions, while Boosting adapts more closely to the data's specifics,
potentially capturing subtle details at the risk of overfitting.
【Trivia】
The term "ensemble" in machine learning is borrowed from ensemble
performances in music, where multiple instruments play together to create a
richer sound than a solo performance. Similarly, ensemble methods
combine multiple models to create a stronger prediction than a single model
27. Visualizing ROC AUC for Multi-Class
Classification Using Python
Importance★★★★☆
Difficulty★★★☆☆
A company wants to implement a multi-class classification model to predict
the likelihood of a customer belonging to one of three different customer
segments.You are asked to evaluate the performance of the model by
plotting the ROC curve for each class and calculating the ROC AUC score
for the multi-class problem.Use Python to generate synthetic data, build a
multi-class classification model using Logistic Regression, and visualize the
ROC curves for each class.Make sure the visualization includes separate
ROC curves for each class and the overall macro-average AUC score.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split


# # Generate synthetic multi-class data

X, y = make_classification(n_samples=1000, n_features=20, n_classes=3,


n_informative=3, random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import label_binarize

from sklearn.metrics import roc_curve, auc


from sklearn.multiclass import OneVsRestClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import RocCurveDisplay

# # Generate synthetic data

X, y = make_classification(n_samples=1000, n_features=20, n_classes=3,


n_informative=3, random_state=42)

# # Binarize the output

y = label_binarize(y, classes=[0, 1, 2])

n_classes = y.shape[1]

# # Split the data


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# # Train a One-vs-Rest Logistic Regression classifier

classifier = OneVsRestClassifier(LogisticRegression(solver='liblinear'))

y_score = classifier.fit(X_train, y_train).predict_proba(X_test)

# # Compute ROC curve and ROC AUC for each class

fpr = dict()

tpr = dict()

roc_auc = dict()

for i in range(n_classes):

fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])

roc_auc[i] = auc(fpr[i], tpr[i])


# # Plot ROC curve for each class

plt.figure()

for i in range(n_classes):

plt.plot(fpr[i], tpr[i], label='Class {0} (area = {1:0.2f})'.format(i,


roc_auc[i]))

# # Plot settings

plt.plot([0, 1], [0, 1], 'k--')

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')


plt.ylabel('True Positive Rate')

plt.title('ROC Curve for Multi-Class Classification')

plt.legend(loc='lower right')

plt.show()

In this exercise, the main goal is to evaluate a multi-class classification


model using the ROC AUC metric, which is typically used in binary
classification.However, for multi-class problems, we can extend the ROC
AUC by using a One-vs-Rest strategy, where the classifier treats each class
separately against the rest of the classes.The OneVsRestClassifier function
helps apply Logistic Regression to multiple classes.We start by generating
synthetic data using the make_classification function, which creates a
dataset suitable for multi-class classification.We then binarize the labels
using label_binarize, which transforms the output labels into binary format
(one-hot encoding). This step is necessary for calculating the ROC curve
per class.The Logistic Regression model is fitted using a one-vs-rest
scheme, and we predict the probabilities for the test set using
predict_proba.Next, the ROC curve for each class is computed using
roc_curve, and the AUC score is calculated for each ROC curve using
auc.Finally, we visualize the ROC curves for each class and plot the overall
results.This visualization helps in understanding the model’s performance
across different classes. The ROC curve for each class indicates the trade-
off between the true positive rate and the false positive rate for that specific
class.
【Trivia】
The ROC curve, originally developed for signal detection theory, is used in
various domains, including medical decision-making and machine learning.
28. Visualizing Word Embeddings Using PCA
Importance★★★★☆
Difficulty★★★☆☆
A startup company has developed an AI-powered text analysis tool, and
they need to visualize how different words in customer reviews are related
to one another in terms of meaning. Your task is to use word embeddings to
show the relationships between words by visualizing their distribution in a
2D space using Principal Component Analysis (PCA).In this exercise,
generate a set of word embeddings using a simple pre-defined corpus, then
apply PCA to reduce the dimensionality of the embeddings, and finally
create a 2D plot to visualize the results.This exercise will help the startup
team understand which words are semantically close to each other.
【Data Generation Code Example】
import numpy as np

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

#### Create sample word embeddings

words = ['cat', 'dog', 'apple', 'orange', 'table', 'chair', 'car', 'bus']

embeddings = np.random.rand(len(words), 50)


【Diagram Answer】

【Code Answer】
import numpy as np

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt


#### Generate sample data

words = ['cat', 'dog', 'apple', 'orange', 'table', 'chair', 'car', 'bus']

embeddings = np.random.rand(len(words), 50)

#### Standardize the embeddings before PCA


scaler = StandardScaler()

embeddings_scaled = scaler.fit_transform(embeddings)
#### Apply PCA to reduce to 2 dimensions

pca = PCA(n_components=2)

reduced_embeddings = pca.fit_transform(embeddings_scaled)

#### Create a 2D plot

plt.figure(figsize=(10, 7))

for i, word in enumerate(words):

plt.scatter(reduced_embeddings[i, 0], reduced_embeddings[i, 1])

plt.text(reduced_embeddings[i, 0] + 0.02, reduced_embeddings[i, 1] +


0.02, word)

plt.title('Word Embeddings Visualization with PCA')

plt.xlabel('PCA Dimension 1')

plt.ylabel('PCA Dimension 2')

plt.grid(True)

plt.show()

In this exercise, we start by creating a list of words and randomly


generating corresponding word embeddings. In a real-world scenario, word
embeddings are typically generated using models like Word2Vec or GloVe,
which map words to high-dimensional vectors based on their semantic
relationships. Here, we simplify this by using random embeddings for
demonstration purposes.Next, we apply a technique called Principal
Component Analysis (PCA), which is commonly used for dimensionality
reduction. The word embeddings are 50-dimensional vectors, but
visualizing them in such a high-dimensional space would be impossible.
PCA reduces the dimensionality of the data while maintaining the
relationships between the points as much as possible. In this case, we
reduce the embeddings to two dimensions to plot them on a 2D
graph.Before applying PCA, we standardize the word embeddings.
Standardization ensures that each feature of the embeddings has a mean of
0 and a standard deviation of 1. This step is crucial because PCA is
sensitive to the scale of the data.Finally, after applying PCA, we create a 2D
scatter plot to visualize the reduced embeddings. Each word is represented
as a point in the 2D space, and the proximity between points reflects the
similarity of the words' original embeddings. This visualization helps in
identifying words that are close in meaning by looking at how they cluster
together.The plot includes labels for each word, making it easy to interpret
which words are similar or different from one another in the embedding
space. The process highlights a fundamental technique for handling high-
dimensional data in machine learning and showcases how PCA can be used
to make sense of complex relationships between data points.
【Trivia】
PCA is widely used not only for visualizing word embeddings but also in
fields like image compression and genomic data analysis due to its ability to
simplify data while preserving variance.
29. Visualizing Sequential Data using Recurrent
Neural Networks (RNNs)
Importance★★★★★
Difficulty★★★★☆
You are working for a logistics company that needs to predict daily demand
for its delivery services based on sequential data. Your task is to visualize
how an RNN model can be applied to time series data to forecast future
demand. The data consists of sequential daily demand values over 30 days,
which you will generate within the code.The goal is to train a simple RNN
to forecast the next day's demand based on this historical data and plot both
the training data and predictions.Generate synthetic sequential data to
represent daily demand and train an RNN model on it. Visualize both the
true values and predicted values over time using a line graph.
【Data Generation Code Example】
import numpy as np

import matplotlib.pyplot as plt

## Generate synthetic sequential data representing daily demand

days = np.arange(1, 31)

demand = np.sin(days / 2) + np.random.normal(0, 0.1, 30)


【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

import tensorflow as tf

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import SimpleRNN, Dense

## Generate synthetic sequential data representing daily demand

days = np.arange(1, 31)


demand = np.sin(days / 2) + np.random.normal(0, 0.1, 30)

## Prepare data for RNN: reshape it to be [samples, time steps, features]

X = np.array([demand[i:i+5] for i in range(25)])

y = demand[5:30]

X = X.reshape((X.shape[0], X.shape[1], 1))

## Build a simple RNN model

model = Sequential([SimpleRNN(10, activation='relu', input_shape=(5,


1)),

Dense(1)])

## Compile and train the model


model.compile(optimizer='adam', loss='mse')

model.fit(X, y, epochs=200, verbose=0)

## Predict the next day's demand using the trained model


predicted = model.predict(X)

## Plot true demand and predicted demand

plt.figure()

plt.plot(days[5:], demand[5:], label='True Demand')

plt.plot(days[5:], predicted, label='Predicted Demand', linestyle='--')

plt.xlabel('Day')

plt.ylabel('Demand')

plt.legend()
plt.title('RNN Prediction of Daily Demand')

plt.show()

In this exercise, we focus on using a Recurrent Neural Network (RNN) to


predict sequential data. RNNs are particularly useful for time series data,
where each input is dependent on previous ones. In this case, we are
predicting future demand based on historical demand data.First, we
generate synthetic demand data using a sine wave with added random noise.
The data simulates daily demand over 30 days. We split the data into
smaller sequences of 5 days as input, and the goal is to predict the 6th day's
demand based on the previous 5 days. The input data is reshaped to fit the
required input format for an RNN: [samples, time steps, features].Next, we
build a simple RNN model with one recurrent layer (SimpleRNN) and a
dense layer for prediction. The RNN layer processes the input data over
time steps, capturing the sequential dependencies. The model is trained
using the Adam optimizer and mean squared error (MSE) loss function for
regression purposes.Once trained, the model predicts demand based on the
input sequences, and we visualize both the true demand and predicted
demand using a line plot. This helps us evaluate how well the RNN has
learned the sequential patterns in the data.This task demonstrates the core
concept of RNNs for sequential prediction in Python, emphasizing the
importance of proper data preparation and sequence modeling in real-world
time series applications.
【Trivia】
Recurrent Neural Networks are commonly used in fields like natural
language processing, speech recognition, and financial time series
forecasting. While they capture sequential dependencies, they often struggle
with long-term dependencies due to the "vanishing gradient problem,"
which led to the development of more advanced architectures like LSTM
(Long Short-Term Memory) networks.
30. Visualizing the Impact of Activation Functions
on Neural Network Output
Importance★★★★★
Difficulty★★★☆☆
You are working with a client who is developing a neural network model
for classifying customer sentiment based on textual data.Before fine-tuning
the model, the client wants to understand how different activation functions
(ReLU, Sigmoid, and Tanh) impact the output of their network during the
training process.Your task is to create a simple neural network, apply
different activation functions to it, and visualize their impact on the output
layer.Create random input data for the model, simulate the training, and plot
the activation functions' effects on the output layer.
【Data Generation Code Example】
import numpy as np

X = np.random.randn(100, 3) # # Create a dataset with 100 samples and


3 features

y = np.random.randint(0, 2, size=(100, 1)) # # Generate binary labels (0


or 1)
【Diagram Answer】

【Code Answer】
import numpy as np
import matplotlib.pyplot as plt

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense


from tensorflow.keras.activations import relu, sigmoid, tanh

X = np.random.randn(100, 3) # # Create random input data

y = np.random.randint(0, 2, size=(100, 1)) # # Binary target labels (0 or


1)

# # Create a simple neural network model with 1 hidden layer


def create_model(activation):

model = Sequential([

Dense(5, input_dim=3, activation=activation), # # Hidden layer


with specified activation function

Dense(1, activation='sigmoid') # # Output layer with sigmoid


activation for binary classification

])

model.compile(optimizer='adam', loss='binary_crossentropy')

return model

# # Train models with different activation functions and plot the outputs

activations = {'ReLU': relu, 'Sigmoid': sigmoid, 'Tanh': tanh}

outputs = []

for name, act in activations.items():

model = create_model(act)

model.fit(X, y, epochs=10, verbose=0) # # Train the model for 10


epochs

outputs.append((name, model.predict(X))) # # Store the output for


visualization

# # Plot the output of each model

plt.figure(figsize=(10, 6))

for name, output in outputs:


plt.plot(output, label=name) # # Plot each output

plt.title('Comparison of Activation Functions')


plt.xlabel('Sample Index')

plt.ylabel('Output Value')

plt.legend()

plt.show()

In this exercise, we are tasked with demonstrating how different activation


functions affect the output of a neural network.Activation functions are
crucial in defining how the output of each neuron behaves based on the
input values it receives.The most common activation functions used in
neural networks are ReLU (Rectified Linear Unit), Sigmoid, and Tanh.In
this task, we define a neural network model with one hidden layer and one
output layer.The hidden layer uses different activation functions (ReLU,
Sigmoid, Tanh), while the output layer uses a sigmoid activation to ensure
binary classification output.‣ ReLU (Rectified Linear Unit): This function
outputs the input directly if it’s positive, otherwise, it outputs zero. It helps
avoid the vanishing gradient problem but can lead to “dead neurons” for
negative inputs.‣ Sigmoid: This function maps any real number into the
range between 0 and 1, making it ideal for binary classification. However, it
suffers from the vanishing gradient problem when inputs are too small or
too large.‣ Tanh (Hyperbolic Tangent): Similar to Sigmoid but with a range
between -1 and 1. It is typically used in tasks where zero-centered outputs
are beneficial.We initialize three models, each with a different activation
function for the hidden layer.We use a simple dataset of random numbers
and binary labels, representing a classification task with two classes (0 or
1).After training each model for 10 epochs, we collect the model outputs
and visualize them using Matplotlib.This allows us to see how each
activation function shapes the output values after training.By comparing the
plotted output, the client can understand how ReLU tends to produce sparse
outputs with more zeros,Sigmoid outputs are bounded between 0 and 1,
while Tanh centers around 0, making the decision boundaries more
interpretable in some cases.
【Trivia】
Did you know that ReLU is one of the most popular activation functions in
deep learning models today?Despite its simplicity, it often leads to better
performance than Sigmoid or Tanh in deep networks because of its ability
to maintain non-zero gradients during backpropagation!
31. Visualizing Forecasting Models Using Linear
Regression and Time Series Data
Importance★★★★☆
Difficulty★★★☆☆
A retail company wants to forecast future sales based on previous sales
data.They believe that there is a linear trend in their sales over time and
want to create a model to forecast future values.Your task is to generate a
synthetic sales dataset, create a linear regression forecasting model, and
visualize the forecasted values along with the original data.The dataset
should include sales figures for each month over a 3-year period (36
months).After building the model, you will also plot the actual sales data
and the predicted sales data on the same graph to evaluate the forecast.
【Data Generation Code Example】
import numpy as np

import pandas as pd

# # Create monthly sales data for 36 months


months = np.arange(1, 37)

np.random.seed(42)

sales = 50 + months * 1.5 + np.random.normal(scale=10, size=36)

# # Convert to a pandas DataFrame

data = pd.DataFrame({'Month': months, 'Sales': sales})


【Diagram Answer】

【Code Answer】
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

# # Create monthly sales data for 36 months

months = np.arange(1, 37)

np.random.seed(42)

sales = 50 + months * 1.5 + np.random.normal(scale=10, size=36)

# # Convert to a pandas DataFrame


data = pd.DataFrame({'Month': months, 'Sales': sales})

# # Prepare data for linear regression

X = data['Month'].values.reshape(-1, 1)

y = data['Sales'].values

# # Build the linear regression model

model = LinearRegression()

model.fit(X, y)

# # Predict future sales

y_pred = model.predict(X)

# # Plot the actual and predicted sales

plt.plot(data['Month'], data['Sales'], label='Actual Sales', marker='o')

plt.plot(data['Month'], y_pred, label='Predicted Sales', linestyle='--')


# # Add labels and legend

plt.title('Sales Forecasting with Linear Regression')

plt.xlabel('Month')

plt.ylabel('Sales')

plt.legend()

# # Display the plot

plt.show()

In this task, we are using a linear regression model to forecast future sales
based on historical sales data.
We begin by generating synthetic monthly sales data over a period of 36
months.
The sales data is created with a small linear growth factor, plus some
random noise to make it more realistic.
Next, we prepare the dataset for the machine learning model.
The month numbers are used as the input (X), and the sales values are used
as the target (y).
To build the forecasting model, we use the LinearRegression class from the
sklearn library.
After fitting the model to the data, we use it to predict the sales for each
month in the dataset.
Finally, we visualize both the actual sales data and the predicted sales data
on the same graph.
This allows us to evaluate how well the model fits the data and how
accurate the forecasts are.
The graph helps in comparing the trend and any deviations between the
predicted and actual values.
This exercise demonstrates how linear regression can be used to forecast
time series data.
In real-world applications, more complex models might be required for
datasets with seasonality or non-linear trends.
【Trivia】
Linear regression is one of the simplest forecasting models, but it is often
not sufficient for capturing complex patterns in time series data.Advanced
methods such as ARIMA, Prophet, or LSTM (neural networks) are
frequently used for more accurate forecasting in business applications.
32. Visualizing the Impact of Dropout in a Neural
Network for Customer Churn Prediction
Importance★★★★☆
Difficulty★★★☆☆
A telecommunications company is facing a high customer churn rate, where
customers are canceling their services at an increasing rate.You are tasked
with building a simple neural network model to predict customer churn
using synthetic data.The focus is to visualize how different dropout rates
affect the model's performance and its ability to generalize.Please create a
neural network using Keras with at least two hidden layers, and use dropout
as a regularization technique.Compare the performance of the model with
and without dropout and plot the training and validation losses for both
cases.Generate synthetic data to simulate customer features, with 10 input
features and a binary target variable indicating churn (1 for churn, 0 for no
churn).
【Data Generation Code Example】
import numpy as np

np.random.seed(42)

X = np.random.rand(1000, 10) # 1000 samples, 10 features

y = np.random.randint(0, 2, 1000) # binary target for churn/no churn


【Diagram Answer】

【Code Answer】
#import necessary libraries

import numpy as np

import matplotlib.pyplot as plt

from keras.models import Sequential

from keras.layers import Dense, Dropout

from keras.optimizers import Adam

from sklearn.model_selection import train_test_split


#generate synthetic data

np.random.seed(42)

X = np.random.rand(1000, 10)

y = np.random.randint(0, 2, 1000)

#split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

#build model with dropout

def create_model(dropout_rate):

model = Sequential()
model.add(Dense(64, input_dim=10, activation='relu'))

model.add(Dropout(dropout_rate))

model.add(Dense(32, activation='relu'))
model.add(Dropout(dropout_rate))

model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer=Adam(), loss='binary_crossentropy',
metrics=['accuracy'])

return model

#train and evaluate the model with dropout

model_dropout = create_model(0.5)

history_dropout = model_dropout.fit(X_train, y_train, validation_data=


(X_test, y_test), epochs=50, batch_size=32, verbose=0)
#train and evaluate the model without dropout

model_no_dropout = create_model(0.0)

history_no_dropout = model_no_dropout.fit(X_train, y_train,


validation_data=(X_test, y_test), epochs=50, batch_size=32, verbose=0)

#plot the training and validation losses for both cases

plt.plot(history_dropout.history['loss'], label='Train Loss with Dropout')

plt.plot(history_dropout.history['val_loss'], label='Validation Loss with


Dropout')

plt.plot(history_no_dropout.history['loss'], label='Train Loss without


Dropout')

plt.plot(history_no_dropout.history['val_loss'], label='Validation Loss


without Dropout')

plt.title('Impact of Dropout on Neural Network Performance')

plt.xlabel('Epochs')

plt.ylabel('Loss')

plt.legend()

plt.show()

Dropout is a regularization technique used in neural networks to prevent


overfitting by randomly disabling a fraction of neurons during each training
iteration.In this exercise, we built two models to observe how dropout
affects neural network performance: one with a dropout rate of 50% and the
other without dropout.We generated synthetic data representing customer
features and churn, with 1000 samples and 10 features each.The dataset was
split into training and testing sets, where 70% of the data was used for
training and 30% for testing.The neural network was built using Keras with
two hidden layers, each followed by a dropout layer in the dropout-enabled
model.Both models were trained for 50 epochs, and their losses were
plotted to compare performance.The loss represents the error between the
predicted and actual outcomes, and we visualized the training loss and
validation loss for both models.When dropout is applied, the model tends to
generalize better, reducing the risk of overfitting, as seen when the
validation loss is lower and less divergent from the training loss compared
to the model without dropout.
【Trivia】
The concept of dropout was introduced by Geoffrey Hinton in 2012 as a
technique to reduce overfitting in neural networks.It was inspired by the
idea of preventing co-adaptation of neurons, making neural networks more
robust by forcing them to learn more distributed representations.
33. Visualizing Error Distributions in Machine
Learning Models
Importance★★★★☆
Difficulty★★★☆☆
You are a data scientist working for a financial institution.Your team has
developed a machine learning regression model to predict the house prices
based on various features.However, you need to check the distribution of
errors (residuals) to ensure that the model is not biased.Using the model's
predictions and actual data, your task is to visualize the distribution of
errors to evaluate the model's performance.Create a histogram and a box
plot to visualize the residuals.For simplicity, generate the input data
(features and target) within the code.
【Data Generation Code Example】
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression


# Generate sample data for house prices (features and target)

np.random.seed(42)

X = np.random.rand(500, 1) * 10 # Features: 500 samples with 1 feature


(scaled between 0 and 10)

y = 3.5 * X.flatten() + np.random.randn(500) * 2 # Target: linear function


with added noise

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

# Generate sample data for house prices (features and target)

np.random.seed(42)

X = np.random.rand(500, 1) * 10 # Features: 500 samples with 1 feature


(scaled between 0 and 10)
y = 3.5 * X.flatten() + np.random.randn(500) * 2 # Target: linear function
with added noise

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Train the regression model

model = LinearRegression()

model.fit(X_train, y_train)

# Predict on the test set

y_pred = model.predict(X_test)

# Calculate residuals (errors)

residuals = y_test - y_pred

# Create a figure to plot the distribution of residuals


plt.figure(figsize=(12, 6))

# Plot the histogram of residuals

plt.subplot(1, 2, 1)
plt.hist(residuals, bins=30, edgecolor='black', alpha=0.7)

plt.title('Histogram of Residuals')

plt.xlabel('Residual')

plt.ylabel('Frequency')

# Plot the box plot of residuals

plt.subplot(1, 2, 2)

plt.boxplot(residuals, vert=False, patch_artist=True)

plt.title('Box Plot of Residuals')


plt.xlabel('Residual')

plt.tight_layout()

plt.show()

In this problem, we aim to visualize the distribution of errors (residuals) to


check for potential bias or anomalies in a machine learning model's
predictions.First, we generate synthetic data for house prices. The feature
values (X) are random numbers, scaled between 0 and 10, while the target
values (y) follow a linear relationship with some added noise to simulate
real-world conditions.We then split this data into training and testing sets
using train_test_split. The LinearRegression model is trained on the training
set, and predictions are made on the test set.Next, the residuals are
calculated by subtracting the predicted values from the actual target values
(y_test - y_pred).To visualize these residuals, we create two plots: a
histogram and a box plot.The histogram helps us understand the frequency
distribution of residuals and detect any skewness or outliers. A normal
distribution of residuals around zero suggests that the model performs
well.The box plot provides a clear summary of the distribution, showing the
median, quartiles, and potential outliers. This plot is useful for identifying
extreme values in the residuals that may indicate problematic
predictions.Both plots are generated using matplotlib, with a clear
distinction between the two visualizations to evaluate the errors in different
ways.
【Trivia】
Did you know that understanding the distribution of residuals is essential in
regression problems because it can reveal whether the model assumptions
are met? In linear regression, it is assumed that residuals should be
normally distributed and have constant variance (homoscedasticity). If these
assumptions are violated, the model might give misleading results!
34. Visualizing Feature Distributions After Scaling
Transformation
Importance★★★★☆
Difficulty★★★☆☆
A company is working on a machine learning model to predict customer
purchasing patterns based on several features, including customer age,
income, and the number of previous purchases. The features vary
significantly in scale, making it difficult to train an effective model. You
have been asked to visualize the distributions of these features both before
and after applying scaling techniques such as StandardScaler and
MinMaxScaler.Write a Python program to:Generate a dataset containing
three features: customer age (normally distributed), income (log-normal
distributed), and previous purchases (uniformly distributed).Apply scaling
transformations (StandardScaler and MinMaxScaler) to these features.Plot
and compare the distributions of the original and scaled features using
histograms.
【Data Generation Code Example】
import numpy as npimport pandas as pdCreate sample
datanp.random.seed(42)data = pd.DataFrame({'age':
np.random.normal(loc=40, scale=12, size=100),'income':
np.random.lognormal(mean=3, sigma=0.8, size=100),'purchases':
np.random.uniform(1, 50, 100)})
【Diagram Answer】

【Code Answer】
import numpy as npimport pandas as pdimport matplotlib.pyplot as
pltfrom sklearn.preprocessing import StandardScaler, MinMaxScaler#
Create sample datanp.random.seed(42)data = pd.DataFrame({'age':
np.random.normal(loc=40, scale=12, size=100), # # Age feature
(normally distributed)'income': np.random.lognormal(mean=3,
sigma=0.8, size=100), # # Income feature (log-normal
distribution)'purchases': np.random.uniform(1, 50, 100) # # Previous
purchases (uniform distribution)})# Initialize scalersstandard_scaler =
StandardScaler()minmax_scaler = MinMaxScaler()# Apply
StandardScaler and MinMaxScalerdata_standard_scaled =
pd.DataFrame(standard_scaler.fit_transform(data),
columns=data.columns)data_minmax_scaled =
pd.DataFrame(minmax_scaler.fit_transform(data),
columns=data.columns)# Plotting the original, StandardScaled, and
MinMaxScaled distributionsfig, axs = plt.subplots(3, 3, figsize=(15,
12))for i, column in enumerate(data.columns):# # Original feature
distributionaxs[0, i].hist(data[column], bins=15, color='blue',
alpha=0.7)axs[0, i].set_title(f'Original {column}')# # StandardScaler
feature distribution

axs[1, i].hist(data_standard_scaled[column], bins=15, color='green',


alpha=0.7)

axs[1, i].set_title(f'StandardScaled {column}')

# # MinMaxScaler feature distribution

axs[2, i].hist(data_minmax_scaled[column], bins=15, color='red',


alpha=0.7)

axs[2, i].set_title(f'MinMaxScaled {column}')

plt.tight_layout()plt.show()

In this task, we simulate a realistic dataset representing customer


purchasing behavior using three features: customer age, income, and
previous purchases. Each feature has different statistical distributions:‣ Age
follows a normal distribution, which is typical for demographic variables.‣
Income follows a log-normal distribution, representing the typical
distribution of income data where most values are clustered at the lower end
but a few outliers exist at the high end.‣ Previous purchases are uniformly
distributed, where each customer has an equal likelihood of making any
number of purchases within a defined range.When training machine
learning models, different features often have varying scales. This can make
certain models, such as those that rely on distance measures, like K-Nearest
Neighbors (KNN), or optimization algorithms in linear regression and
neural networks, less effective. Therefore, it is crucial to scale the
features.To solve this, we use two common scaling methods:‣
StandardScaler: This method scales the features so that they have a mean of
0 and a standard deviation of 1.‣ MinMaxScaler: This method scales the
features between a defined range, typically [0,1].In the code, after
generating the data, we apply the StandardScaler and MinMaxScaler using
the fit_transform method. Then, we plot histograms to visualize how these
scaling methods transform the original feature distributions.The first row of
histograms shows the original data distributions. The second row visualizes
the StandardScaler-transformed data, and the third row shows the
MinMaxScaler-transformed data. This comparison helps us understand how
scaling impacts the feature distributions, making them more suitable for
machine learning algorithms.
【Trivia】
Did you know that different scaling methods can significantly impact the
performance of algorithms such as Support Vector Machines (SVMs) and
K-Means clustering? For algorithms sensitive to the distance between
points, having unscaled features can bias the model towards features with
larger scales, making scaling a critical preprocessing step.
35. Visualizing Customer Decision-Making with
Decision Trees
Importance★★★★★
Difficulty★★☆☆☆
You are working for a telecom company, and your manager is concerned
about the high churn rate of customers.To address this, you are tasked with
building a decision tree model to predict whether a customer will leave
based on their contract information and monthly charges.Create a sample
dataset with three features: 'Contract Type' (1-year, 2-year, or month-to-
month), 'Monthly Charges', and 'Customer Churn' (whether the customer
left or not).Build a decision tree classifier using this dataset and visualize
the tree to understand how decisions are made regarding customer
churn.Your task is to implement the model and visualize the resulting
decision tree.
【Data Generation Code Example】
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

# # Create a simple dataset for telecom customers

data = {'Contract Type': np.random.choice(['1-year', '2-year', 'month-to-


month'], 100),

'Monthly Charges': np.random.uniform(50, 150, 100),


'Customer Churn': np.random.choice([0, 1], 100)}

df = pd.DataFrame(data)
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

# # Create a simple dataset for telecom customers


data = {'Contract Type': np.random.choice(['1-year', '2-year', 'month-to-
month'], 100),

'Monthly Charges': np.random.uniform(50, 150, 100),

'Customer Churn': np.random.choice([0, 1], 100)}

df = pd.DataFrame(data)

# # Convert categorical 'Contract Type' to numerical

df['Contract Type'] = df['Contract Type'].map({'1-year': 1, '2-year': 2,


'month-to-month': 0})

# # Split the dataset

X = df[['Contract Type', 'Monthly Charges']]

y = df['Customer Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# # Build the decision tree classifier

clf = DecisionTreeClassifier(max_depth=3, random_state=42)

clf.fit(X_train, y_train)

# # Visualize the decision tree

plt.figure(figsize=(10, 8))

plot_tree(clf, feature_names=['Contract Type', 'Monthly Charges'],


class_names=['Not Churn', 'Churn'], filled=True)

plt.show()

In this problem, the goal is to predict customer churn using a decision


tree.We start by creating a dataset that represents a telecom company's
customer data. The dataset has three features: the contract type, monthly
charges, and whether the customer churned (left) or not.The 'Contract Type'
feature is categorical (1-year, 2-year, or month-to-month), so we convert it
into numerical values. This is necessary because machine learning models
cannot process categorical strings directly. The map() function is used to
map each contract type to a corresponding number:‣ '1-year' = 1‣ '2-year' =
2‣ 'month-to-month' = 0Next, we split the data into training and testing sets
using the train_test_split function. This is important for validating the
model's performance. 80% of the data is used for training the model, and
20% is reserved for testing.We use a decision tree classifier from the
sklearn library. The DecisionTreeClassifier is a machine learning model that
is based on binary decision-making. At each node, the tree decides the best
feature to split the data to minimize classification error.We limit the depth
of the tree to 3 levels using the max_depth parameter to avoid overfitting
the model to the training data. This keeps the model interpretable and
allows us to clearly visualize the decision-making process.Finally, the
plot_tree function is used to visualize the decision tree. The plot shows how
the model decides whether a customer will churn based on contract type
and monthly charges. The filled=True argument adds color to the nodes,
helping to differentiate the classification outcomes.This visualization can be
helpful for interpreting the model’s decision-making process and
identifying key factors influencing customer churn.
【Trivia】
The decision tree algorithm uses the concept of "information gain" to split
nodes. Information gain measures the effectiveness of a feature in
separating the data into classes (churn vs. not churn).
36. Evaluating Regression Models with Residual
Plots to Improve Predictions
Importance★★★★★
Difficulty★★★☆☆
Imagine you are a data scientist working for an e-commerce platform. The
company wants to predict the sales of a newly launched product based on
marketing expenditure, using linear regression. However, there are concerns
that the linear model might not be accurate enough, and your task is to
evaluate the model by creating residual plots to check the assumptions of
linear regression.Your task is to:Create a dataset where sales depend on
marketing expenditure (you will generate this data yourself).Fit a linear
regression model to predict sales based on marketing expenditure.Plot the
residuals of the model against the predicted values to evaluate how well the
model fits.Identify whether there is any pattern in the residuals that suggests
the linear model is not appropriate.You should also provide a brief
interpretation based on the residual plot.
【Data Generation Code Example】
import numpy as np

import pandas as pd

# Generating random marketing expenditure data and sales data

np.random.seed(0)

marketing_expenditure = np.random.uniform(1000, 5000, 100)

noise = np.random.normal(0, 2000, 100)


sales = 2.5 * marketing_expenditure + noise

# Creating a DataFrame

data = pd.DataFrame({'MarketingExpenditure': marketing_expenditure,


'Sales': sales})
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt

# Step 1: Generate random data

np.random.seed(0)

marketing_expenditure = np.random.uniform(1000, 5000, 100)


noise = np.random.normal(0, 2000, 100)

sales = 2.5 * marketing_expenditure + noise

# Step 2: Create DataFrame

data = pd.DataFrame({'MarketingExpenditure': marketing_expenditure,


'Sales': sales})

# Step 3: Fit linear regression model

X = data[['MarketingExpenditure']]

y = data['Sales']

model = LinearRegression()

model.fit(X, y)
predictions = model.predict(X)

# Step 4: Calculate residuals (errors)

residuals = y - predictions
# Step 5: Create residual plot

plt.scatter(predictions, residuals)

plt.axhline(y=0, color='r', linestyle='--')

plt.xlabel('Predicted Values')

plt.ylabel('Residuals')

plt.title('Residual Plot')

plt.show()
This exercise involves assessing the quality of a linear regression model by
plotting residuals. Linear regression assumes that the residuals (errors
between the observed and predicted values) are randomly distributed with a
constant variance and are normally distributed around zero.Generating
Data: A dataset is first generated where the sales depend on marketing
expenditure with some added noise (random variation), which simulates
real-world scenarios where data often includes unpredictable
variations.Fitting the Model: We fit a linear regression model using the
LinearRegression class from the sklearn library, where the independent
variable is the marketing expenditure, and the dependent variable is the
sales.Residuals Calculation: Once the model has predicted the sales values,
we calculate the residuals by subtracting the predicted sales values from the
actual sales values. These residuals are essential in determining the model's
performance.Residual Plot: The residual plot is generated by plotting the
predicted values on the x-axis and the residuals on the y-axis. The
horizontal red line indicates zero residuals, and ideally, residuals should be
scattered randomly around this line without forming any patterns.If the
residuals form a clear pattern, it suggests that the linear regression model
may not be appropriate for this data, indicating that other types of models or
transformations might be needed to improve the predictions.
【Trivia】
Residual plots are one of the most important diagnostic tools for evaluating
regression models. If residuals display any pattern, such as a curve or
funnel shape, it suggests that the model might be missing key variables or
interactions. In such cases, nonlinear regression models or transformations
might be necessary to capture the true relationship.
37. PCA and Biplots for Customer Data Analysis
Importance★★★★☆
Difficulty★★★☆☆
Imagine you are a data scientist at a retail company. The company collects
data on customer behavior, including their total spending in various product
categories. Your task is to analyze customer spending patterns to identify
underlying factors or trends that explain their behavior. You are asked to
perform a Principal Component Analysis (PCA) on customer data and
visualize the results using a biplot.Create a dataset that simulates spending
behavior across different product categories, such as 'Food', 'Electronics',
'Clothing', and 'Furniture'.Perform PCA to reduce the dimensionality of the
dataset and visualize the results with a biplot. The company wants to know
if customer behaviors can be grouped or explained by fewer components.
【Data Generation Code Example】
import numpy as np

np.random.seed(42)

categories = ['Food', 'Electronics', 'Clothing', 'Furniture']


customers = 100

data = np.random.rand(customers, len(categories)) * 500 # Simulating


customer spending
【Diagram Answer】

【Code Answer】
# Importing necessary libraries

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

# Data generation (simulated customer spending data)


categories = ['Food', 'Electronics', 'Clothing', 'Furniture']

customers = 100

data = np.random.rand(customers, len(categories)) * 500

# Standardizing the data before applying PCA

scaler = StandardScaler()

data_scaled = scaler.fit_transform(data)

# Performing PCA, aiming to reduce the dataset to 2 components

pca = PCA(n_components=2)

pca_result = pca.fit_transform(data_scaled)

# Creating a biplot to visualize the PCA result

def biplot(score, coeff, labels=None):

plt.figure(figsize=(8,6))
plt.scatter(score[:,0], score[:,1], color='blue', s=50) # Plot the
principal components

for i in range(len(coeff)):

plt.arrow(0, 0, coeff[i,0]*max(score[:,0]),
coeff[i,1]*max(score[:,1]),

color='r', head_width=0.05, head_length=0.1) # Arrows


representing the original features

if labels is None:

plt.text(coeff[i,0]*max(score[:,0])*1.2,
coeff[i,1]*max(score[:,1])*1.2,

"Var"+str(i+1), color='green', ha='center', va='center')


else:

plt.text(coeff[i,0]*max(score[:,0])*1.2,
coeff[i,1]*max(score[:,1])*1.2,

labels[i], color='green', ha='center', va='center')

plt.xlabel("Principal Component 1")

plt.ylabel("Principal Component 2")

plt.grid(True)

# Plotting the biplot

biplot(pca_result, pca.components_.T, labels=categories)

plt.show()

In this task, you are applying Principal Component Analysis (PCA) to


reduce the dimensionality of customer spending data. PCA is a useful
technique when dealing with high-dimensional datasets, as it identifies the
most significant components that capture the maximum variance in the data,
allowing you to explain the data with fewer features.The first step is to
generate a random dataset, simulating customer spending in different
product categories ('Food', 'Electronics', 'Clothing', 'Furniture'). This dataset
consists of 100 customers, each with spending values in these
categories.Next, the data is standardized using StandardScaler.
Standardization is necessary because PCA is sensitive to the scale of the
data. Features with large variance can dominate the results if not scaled
properly. Standardization ensures that all features contribute equally to the
PCA.The PCA is performed using PCA(n_components=2), where we aim
to reduce the data to two principal components. These two components are
linear combinations of the original features ('Food', 'Electronics', 'Clothing',
'Furniture'), and they capture the most variance in the dataset.Finally, the
result is visualized using a biplot. A biplot is a combination of a scatter plot
of the PCA scores and arrows representing the original features. The arrows
indicate the directions of the original variables in the PCA space. This
visualization helps you understand how the original features contribute to
the two principal components and how the customers are distributed across
the new feature space.This type of analysis is essential in many fields, such
as customer segmentation, where the goal is to simplify complex datasets
and identify patterns or trends that drive customer behavior.
【Trivia】
PCA was first introduced by Karl Pearson in 1901 as a way to reduce the
dimensionality of data while preserving as much variability as possible. It
has since become a fundamental technique in machine learning and data
analysis.
38. Visualizing Neural Network Training History
Using Matplotlib
Importance★★★★☆
Difficulty★★★☆☆
You are a data scientist working for a retail company that wants to predict
customer churn using a neural network model. The company has collected
customer data, including features such as the number of purchases, account
age, and engagement frequency.Your task is to build a simple neural
network model that trains on this customer data and visualize its training
performance.The goal is to track how the model’s accuracy and loss change
over time during the training process. Visualize these metrics in a line plot
to provide insights into the model's performance.Create sample data directly
in the code and train the neural network, then output the accuracy and loss
graphs. Ensure you provide code that can be used to plot these training
history results.
【Data Generation Code Example】
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

# # Generate sample customer data (inputs and outputs) for churn


prediction

X = np.random.rand(1000, 10) # # 1000 customers with 10 features each

y = np.random.randint(2, size=1000) # # Binary labels for churn (0 or 1)


# # Split into train and test datasets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# # Standardize the features


scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

# # Generate sample customer data (inputs and outputs) for churn


prediction

X = np.random.rand(1000, 10) # # 1000 customers with 10 features each

y = np.random.randint(2, size=1000) # # Binary labels for churn (0 or 1)

# # Split into train and test datasets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# # Standardize the features

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

# # Define a simple neural network model

model = Sequential([

Dense(32, input_dim=10, activation='relu'),

Dense(16, activation='relu'),

Dense(1, activation='sigmoid')

])

# # Compile the model


model.compile(optimizer='adam', loss='binary_crossentropy', metrics=
['accuracy'])

# # Train the model and capture the history of the training process

history = model.fit(X_train, y_train, validation_data=(X_test, y_test),


epochs=30, batch_size=32, verbose=0)

# # Extract accuracy and loss from the training history

acc = history.history['accuracy']

val_acc = history.history['val_accuracy']

loss = history.history['loss']

val_loss = history.history['val_loss']
# # Create epochs range for the x-axis

epochs_range = range(1, 31)

# # Plot accuracy and loss over the epochs

plt.figure(figsize=(12, 6))

# # Accuracy plot

plt.subplot(1, 2, 1)

plt.plot(epochs_range, acc, label='Training Accuracy')

plt.plot(epochs_range, val_acc, label='Validation Accuracy')

plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

# # Loss plot

plt.subplot(1, 2, 2)

plt.plot(epochs_range, loss, label='Training Loss')

plt.plot(epochs_range, val_loss, label='Validation Loss')

plt.legend(loc='upper right')

plt.title('Training and Validation Loss')

plt.show()

In this task, you are training a neural network model to predict customer
churn using artificial data generated within the code.The neural network is
created using TensorFlow’s Sequential model, which allows layers to be
stacked one after another.The model has an input layer with 10 features,
followed by two hidden layers with 32 and 16 neurons, respectively.The
final output layer has a single neuron with a sigmoid activation function,
which is suitable for binary classification (predicting whether a customer
will churn or not).The model is compiled with the Adam optimizer and
binary cross-entropy loss function.The training history, including accuracy
and loss over 30 epochs, is stored and then visualized using matplotlib.The
training process runs for 30 epochs, and both training and validation data
are monitored. The accuracy and loss metrics are extracted from the
training history to visualize the performance of the model.Two line plots are
created: one for accuracy and another for loss, for both training and
validation sets. These plots are essential for understanding whether the
model is learning effectively or overfitting.In the code, matplotlib is used to
generate the plots. The epochs (1 through 30) are used on the x-axis, while
accuracy and loss values are plotted on the y-axis.By observing the graphs,
you can assess how well the model is performing on both the training and
validation datasets. If validation loss increases while validation accuracy
decreases, this indicates overfitting, where the model is performing well on
the training data but poorly on unseen data.
【Trivia】
The early stopping technique can be applied during model training to
prevent overfitting by stopping the training once the model performance
stops improving on the validation data.
39. Building and Visualizing a Multi-Output
Regression Model for Predicting House Prices and
Rental Rates
Importance★★★★☆
Difficulty★★★☆☆
You are working for a real estate company that wants to predict both house
prices and rental rates based on several features.Your task is to create a
multi-output regression model to predict both target variables using
machine learning.The features include the size of the house (in square
meters), the number of bedrooms, and the distance to the city center (in
kilometers).Please train the model, visualize the predicted vs. actual values
for both outputs, and display the two graphs in one plot.
【Data Generation Code Example】
import numpy as np

np.random.seed(42)

X = np.random.rand(100, 3) * [200, 5, 30]


y1 = X[:, 0] * 3000 + X[:, 1] * 5000 - X[:, 2] * 150 +
np.random.randn(100) * 1000

y = X[:, 0] * 50 + X[:, 1] * 100 - X[:, 2] * 10 + np.random.randn(100) *


50

y = np.column_stack([y1, y])
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split

from sklearn.multioutput import MultiOutputRegressor

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error

#Generate data

np.random.seed(42)

X = np.random.rand(100, 3) * [200, 5, 30]

y1 = X[:, 0] * 3000 + X[:, 1] * 5000 - X[:, 2] * 150 +


np.random.randn(100) * 1000
y = X[:, 0] * 50 + X[:, 1] * 100 - X[:, 2] * 10 + np.random.randn(100) *
50

y = np.column_stack([y1, y])

#Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

#Train the model

model =
MultiOutputRegressor(RandomForestRegressor(n_estimators=100,
random_state=42))

model.fit(X_train, y_train)

#Make predictions

y_pred = model.predict(X_test)

#Calculate and print the errors

print("MSE for House Prices:", mean_squared_error(y_test[:, 0],


y_pred[:, 0]))

print("MSE for Rental Rates:", mean_squared_error(y_test[:, 1], y_pred[:,


1]))

#Plot actual vs predicted for both outputs

plt.figure(figsize=(10, 5))

#Plot for house prices

plt.subplot(1, 2, 1)

plt.scatter(y_test[:, 0], y_pred[:, 0], color='blue')


plt.plot([min(y_test[:, 0]), max(y_test[:, 0])], [min(y_test[:, 0]),
max(y_test[:, 0])], color='red')

plt.title('House Prices: Actual vs Predicted')

plt.xlabel('Actual House Prices')

plt.ylabel('Predicted House Prices')

#Plot for rental rates

plt.subplot(1, 2, 2)

plt.scatter(y_test[:, 1], y_pred[:, 1], color='green')

plt.plot([min(y_test[:, 1]), max(y_test[:, 1])], [min(y_test[:, 1]),


max(y_test[:, 1])], color='red')

plt.title('Rental Rates: Actual vs Predicted')

plt.xlabel('Actual Rental Rates')

plt.ylabel('Predicted Rental Rates')

#Show the plots

plt.tight_layout()

plt.show()

This task involves creating a multi-output regression model to predict two


different target variables (house prices and rental rates) using the same input
features.We start by generating the input data. The feature matrix X
includes three columns: house size, number of bedrooms, and distance to
the city center.We generate two target variables: y1 (house prices) and y
(rental rates). These target variables are calculated based on the features,
with added random noise to make the problem more realistic.Next, the data
is split into training and test sets using train_test_split, which allows the
model to learn on one portion of the data and be evaluated on unseen data
(the test set).We then use a MultiOutputRegressor, which allows a single
model to handle multiple outputs. Here, we choose a
RandomForestRegressor as the base model, which is well-suited for
handling complex relationships in data.After training the model, we make
predictions on the test set and calculate the Mean Squared Error (MSE) for
both house prices and rental rates, providing a measure of how well the
model performed.Finally, we visualize the predictions by plotting the actual
vs. predicted values for both outputs in a single figure. This helps to see
how closely the predictions align with the actual values, providing a clear
assessment of the model’s performance.
【Trivia】
Multi-output regression is commonly used in real-world applications where
multiple related outcomes need to be predicted, such as in environmental
modeling, economics, and even healthcare (e.g., predicting multiple health
indicators at once).
40. Comparative Performance of Machine Learning
Algorithms on a Sales Prediction Dataset
Importance★★★★★
Difficulty★★★☆☆
A company wants to predict future sales based on historical data, and they
want to know which machine learning algorithm performs better for this
task.Your goal is to compare two algorithms: Linear Regression and
Decision Tree Regression on a dataset that simulates sales based on
variables such as advertising budget, social media engagement, and product
rating.You need to:Generate a synthetic dataset with three features:
Advertising, SocialMedia, and ProductRating.Use Linear Regression and
Decision Tree Regression to predict the Sales based on the
features.Visualize and compare the performance of these two models using
a scatter plot for predictions and the actual sales data.
【Data Generation Code Example】
import numpy as np

import pandas as pd

#Generate random input data for the dataset

Advertising = np.random.uniform(100, 500, 100)

SocialMedia = np.random.uniform(10, 200, 100)

ProductRating = np.random.uniform(1, 5, 100)

#Create the target variable "Sales" based on a formula (introducing some


randomness)

Sales = 50 + 0.3 * Advertising + 0.5 * SocialMedia + 10 * ProductRating


+ np.random.normal(0, 25, 100)

#Combine into a DataFrame


data = pd.DataFrame({'Advertising': Advertising, 'SocialMedia':
SocialMedia, 'ProductRating': ProductRating, 'Sales': Sales})
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error


#Generate random input data for the dataset

Advertising = np.random.uniform(100, 500, 100)

SocialMedia = np.random.uniform(10, 200, 100)

ProductRating = np.random.uniform(1, 5, 100)


Sales = 50 + 0.3 * Advertising + 0.5 * SocialMedia + 10 * ProductRating
+ np.random.normal(0, 25, 100)

data = pd.DataFrame({'Advertising': Advertising, 'SocialMedia':


SocialMedia, 'ProductRating': ProductRating, 'Sales': Sales})

#Split the dataset into training and testing sets

X = data[['Advertising', 'SocialMedia', 'ProductRating']]

y = data['Sales']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)
#Train the Linear Regression model

lr_model = LinearRegression()

lr_model.fit(X_train, y_train)

y_pred_lr = lr_model.predict(X_test)

#Train the Decision Tree model

dt_model = DecisionTreeRegressor()

dt_model.fit(X_train, y_train)

y_pred_dt = dt_model.predict(X_test)

#Calculate the mean squared error for both models

mse_lr = mean_squared_error(y_test, y_pred_lr)

mse_dt = mean_squared_error(y_test, y_pred_dt)

#Create a plot comparing the predicted sales from both models to the
actual sales

plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_lr, color='blue', label='Linear Regression',
alpha=0.6)

plt.scatter(y_test, y_pred_dt, color='red', label='Decision Tree', alpha=0.6)

plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--',


lw=2, label='Perfect Prediction')

plt.xlabel('Actual Sales')

plt.ylabel('Predicted Sales')

plt.legend()

plt.title('Comparison of Linear Regression vs Decision Tree Regression')

plt.show()

This task demonstrates how to compare the performance of two machine


learning algorithms on the same dataset.We first generate a synthetic dataset
with three features: Advertising, SocialMedia, and ProductRating, which
represent variables likely to influence sales in a real-world scenario. The
target variable Sales is calculated using a simple linear formula with some
added randomness to simulate the variability in real-world sales data.After
generating the data, we split it into training and test sets to ensure that we
can evaluate the models on unseen data, which prevents overfitting.We then
train two models:‣ Linear Regression: This algorithm assumes that the
relationship between the dependent variable (Sales) and independent
variables is linear. It tries to find the line that best fits the data points.‣
Decision Tree Regressor: This algorithm splits the data into smaller subsets
based on the values of the independent variables, creating a tree-like
structure where each leaf represents a prediction.Once the models are
trained, we calculate the Mean Squared Error (MSE) to evaluate the
accuracy of the predictions. The MSE gives us an idea of how much the
predicted values deviate from the actual values, with a lower MSE
indicating better performance.Finally, we visualize the predictions of both
models against the actual sales data using a scatter plot. This allows us to
visually assess how well each model performs in predicting sales. The
diagonal line on the plot represents perfect predictions—any points close to
this line are predicted accurately by the model.By comparing the two
models on the same dataset, we can determine which algorithm might be
more suitable for sales prediction tasks, giving insight into the performance
differences between linear and tree-based approaches.
【Trivia】
Did you know that Decision Trees can easily overfit to noisy data if not
properly tuned? This is why pruning methods and hyperparameter tuning
are essential for ensuring that they generalize well to new data.
41. Evaluating Clustering Algorithm Performance
on Customer Segmentation
Importance★★★★☆
Difficulty★★★☆☆
You are a data analyst at a retail company, tasked with segmenting
customers based on their purchasing patterns.The company wants to
understand how different clustering algorithms perform on a dataset
containing customer purchases. Your job is to test two clustering
algorithms: K-Means and DBSCAN, and visually compare their results
using a scatter plot. The dataset will contain two features: 'Total Purchases'
and 'Number of Transactions.'Generate synthetic data, apply both clustering
methods, and plot the results in two separate scatter plots. Ensure that the
plots clearly label the clusters and display different colors for different
groups.
【Data Generation Code Example】
import numpy as np

# # Generate random customer purchase data

np.random.seed(42)

X = np.vstack([np.random.normal(loc, scale, size=(100, 2))

for loc, scale in [([50, 200], 15), ([100, 500], 30), ([200, 700],
20)]])
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans, DBSCAN


# # Generate random customer purchase data

np.random.seed(42)

X = np.vstack([np.random.normal(loc, scale, size=(100, 2))


for loc, scale in [([50, 200], 15), ([100, 500], 30), ([200, 700],
20)]])

# # Apply K-Means Clustering

kmeans = KMeans(n_clusters=3)

kmeans_labels = kmeans.fit_predict(X)

# # Apply DBSCAN Clustering


dbscan = DBSCAN(eps=50, min_samples=5)

dbscan_labels = dbscan.fit_predict(X)

# # Create scatter plot for K-Means

plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)

plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis')

plt.title('K-Means Clustering')

plt.xlabel('Total Purchases')

plt.ylabel('Number of Transactions')

# # Create scatter plot for DBSCAN

plt.subplot(1, 2, 2)

plt.scatter(X[:, 0], X[:, 1], c=dbscan_labels, cmap='plasma')


plt.title('DBSCAN Clustering')

plt.xlabel('Total Purchases')

plt.ylabel('Number of Transactions')

# # Show both plots

plt.tight_layout()

plt.show()

▸ This task involves evaluating two clustering algorithms, K-Means and


DBSCAN, using Python and visualizing the results to compare the
performance of these algorithms. Clustering is an unsupervised machine
learning technique used to group data points based on their similarities.
Here, you are clustering customer purchase data based on two features:
'Total Purchases' and 'Number of Transactions.'The synthetic data is
generated using numpy by creating three groups of customers with different
average purchases and transaction counts. Each group has a normal
distribution with specific mean values and standard deviations, simulating
real-world variations.Once the data is generated, we apply two clustering
algorithms:
‣ K-Means: This algorithm aims to divide the data into k clusters by
minimizing the within-cluster variance. We specify three clusters based on
the synthetic dataset's structure. K-Means works well when clusters are
spherical and evenly sized.
‣ DBSCAN: This density-based clustering algorithm identifies core
samples surrounded by neighbors and forms clusters based on data density.
It is effective for datasets with varying shapes or noise. The eps parameter
defines the radius for searching neighboring points, and min_samples
determines the minimum number of points required to form a cluster.The
code creates two scatter plots using matplotlib. One plot visualizes the
results of K-Means clustering, and the other visualizes the results of
DBSCAN. Different colors represent different clusters, making it easy to
visually assess the performance of both algorithms on the same dataset.K-
Means tends to perform well when the data is relatively clean and separable
into circular clusters, while DBSCAN can handle noise and irregularly
shaped clusters more effectively.
【Trivia】
K-Means is one of the most popular clustering algorithms and is widely
used due to its simplicity and efficiency. However, it struggles with non-
circular cluster shapes and outliers. DBSCAN, on the other hand, excels in
detecting arbitrary-shaped clusters and is robust to noise, but its
performance can be sensitive to the choice of eps and min_samples values.
42. Visualizing Decision Thresholds in Binary
Classification
Importance★★★★☆
Difficulty★★★☆☆
A company is working on a credit scoring system that predicts whether a
customer will default on their loan or not.You are tasked with determining
the impact of changing decision thresholds on this model's predictions.You
have been provided with some sample data, which simulates whether a
customer has defaulted based on specific financial characteristics.Your goal
is to train a logistic regression classifier, adjust the decision thresholds, and
visualize how changes in the threshold affect the predictions for loan
default.Please generate a plot showing how the true positive rate (recall)
and false positive rate vary with different thresholds.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20,
n_informative=2, n_redundant=10, random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_curve

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20,
n_informative=2, n_redundant=10, random_state=42)

model = LogisticRegression()
model.fit(X, y)

y_scores = model.predict_proba(X)[:, 1] ##Get probability estimates

fpr, tpr, thresholds = roc_curve(y, y_scores) ##Calculate false positive


and true positive rates

plt.plot(thresholds, tpr, label="True Positive Rate (Recall)")

plt.plot(thresholds, fpr, label="False Positive Rate")

plt.xlabel("Decision Threshold")

plt.ylabel("Rate")

plt.title("Impact of Decision Threshold on TPR and FPR")

plt.legend(loc="best")
plt.grid(True)

plt.show()

The problem simulates the task of analyzing the impact of decision


thresholds in binary classification using a logistic regression model.The
logistic regression model estimates the probability that a given input
belongs to the positive class (in this case, whether the customer will
default).By default, classification models use a threshold of 0.5 to classify
probabilities into two classes.If the probability exceeds 0.5, the model
classifies the observation as a positive class (default), and if it's below 0.5,
it classifies it as negative.However, adjusting the decision threshold can
change the model’s behavior.A lower threshold will increase the number of
positive predictions (higher recall), while a higher threshold may decrease
positive predictions but reduce false positives.In the solution, the ROC
curve (Receiver Operating Characteristic) is used to calculate the true
positive rate (recall) and false positive rate for various thresholds.The plot
shows the impact of different thresholds, where the recall and false positive
rates are plotted against decision thresholds.This gives an understanding of
the trade-off between detecting true positives and avoiding false positives
based on the threshold you set.This visualization is essential for decision-
making when optimizing a model for specific business objectives, like
balancing false positives and false negatives in a credit risk model.
【Trivia】
The ROC curve was originally developed during World War II to analyze
radar signal detection but has since been widely adopted in machine
learning to assess classifier performance.
43. Visualizing Feature Importances in Decision
Tree Models
Importance★★★★★
Difficulty★★★☆☆
You are working as a data scientist for a retail company that wants to
understand which features most influence their sales.The company has
gathered data on various product characteristics, pricing strategies, and
advertising methods.Your task is to build a decision tree regressor that
predicts sales based on these features and visualize the feature importances
to help the company make data-driven decisions.Use the provided code to
generate the sample data for this task. Once the model is trained, visualize
the feature importances using a bar chart.
【Data Generation Code Example】
import numpy as np

import pandas as pd

from sklearn.datasets import make_regression


## Create sample data with 5 features and 100 samples

X, y = make_regression(n_samples=100, n_features=5, noise=0.1,


random_state=42)

feature_names = ['Feature_A', 'Feature_B', 'Feature_C', 'Feature_D',


'Feature_E']

df = pd.DataFrame(X, columns=feature_names)

df['Sales'] = y
【Diagram Answer】

【Code Answer】
import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeRegressor

from sklearn.datasets import make_regression

import matplotlib.pyplot as plt

## Create sample data with 5 features and 100 samples

X, y = make_regression(n_samples=100, n_features=5, noise=0.1,


random_state=42)

feature_names = ['Feature_A', 'Feature_B', 'Feature_C', 'Feature_D',


'Feature_E']

## Initialize and train the decision tree model


model = DecisionTreeRegressor(random_state=42)

model.fit(X, y)

## Get feature importances from the trained model

importances = model.feature_importances_

## Create a bar chart to visualize the feature importances

plt.barh(feature_names, importances)

plt.xlabel('Importance')

plt.ylabel('Features')

plt.title('Feature Importances in Decision Tree Model')

plt.show()

In this exercise, we are using a decision tree regressor to predict sales based
on five product-related features.First, we generate the data using
make_regression(), which simulates regression data. The resulting dataset
has 100 samples and 5 features, which we label Feature_A to
Feature_E.The decision tree algorithm is well-suited for identifying
important features in datasets because it makes splits based on features that
reduce prediction error the most.Once the model is trained, we can extract
the feature importances.The feature_importances_ attribute of the trained
decision tree gives us a measure of each feature's contribution to the
predictions.These importances are then visualized using a horizontal bar
chart, which allows us to see which features are most influential in
predicting sales.Visualizing feature importances helps decision-makers
understand which factors drive sales, enabling them to focus on optimizing
those aspects.In this case, the bar chart provides a clear visual guide
showing which features are more significant in influencing sales, and this
can guide future business decisions regarding product development, pricing
strategies, or marketing efforts.
【Trivia】
Did you know that decision trees are considered white-box models because
their decision-making process is easy to interpret?This is in contrast to
models like neural networks, which are often referred to as black-box
models due to their complexity in explaining predictions!
44. Visualizing Confusion Matrices for Multi-Class
Classification Problems
Importance★★★★★
Difficulty★★★☆☆
A client is using a machine learning model to classify images of fruits into
three categories: apples, bananas, and oranges.The client is not only
interested in evaluating the accuracy of the model but also wants to
understand which categories are often confused with each other.Your task is
to simulate a simple multi-class classification problem, train a model, and
then generate a confusion matrix to visualize these misclassifications.You
need to create sample data, train a model, and output a confusion matrix to
help the client understand the performance of their model.Use the following
steps:Generate synthetic data with three classes (apples, bananas,
oranges).Train a simple machine learning classifier (e.g., Logistic
Regression).Display the confusion matrix for the model's predictions.
【Data Generation Code Example】
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.datasets import make_classification

# Create synthetic data for a multi-class classification problem

X, y = make_classification(n_samples=300, n_features=4, n_classes=3,


n_clusters_per_class=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


【Diagram Answer】

【Code Answer】
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.datasets import make_classification

from sklearn.linear_model import LogisticRegression


from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

import seaborn as sns

# Create synthetic data for a multi-class classification problem

X, y = make_classification(n_samples=300, n_features=4, n_classes=3,


n_clusters_per_class=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Train a Logistic Regression classifier

model = LogisticRegression()

model.fit(X_train, y_train)

# Predict on the test set


y_pred = model.predict(X_test)

# Generate confusion matrix

cm = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix

plt.figure(figsize=(6,6))

sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=


["Apple", "Banana", "Orange"], yticklabels=["Apple", "Banana",
"Orange"])

plt.title("Confusion Matrix for Fruit Classification")

plt.xlabel("Predicted Labels")

plt.ylabel("True Labels")

plt.show()
This exercise is centered around visualizing a confusion matrix, which is a
powerful tool for evaluating multi-class classification models.We first
generate synthetic data that mimics a classification problem with three
classes (apples, bananas, and oranges) using the make_classification
function.The data is split into training and testing sets to evaluate the
performance of the model on unseen data.We use a logistic regression
model, which is suitable for simple classification tasks and easy to
train.Once the model is trained, it is used to make predictions on the test
set.The confusion matrix is generated using the confusion_matrix function
from sklearn.metrics.The confusion matrix provides insight into how often
each class (apple, banana, or orange) is correctly or incorrectly
classified.Each row of the matrix represents the true label, and each column
represents the predicted label.Finally, we use the seaborn library's heatmap
functionality to visually display the confusion matrix.The heatmap
highlights where misclassifications occur, allowing for easy identification
of which categories are most frequently confused.For example, if bananas
are often classified as oranges, the corresponding cell in the matrix will
have a higher value.The x-axis shows the predicted labels, and the y-axis
shows the true labels, making it easy to interpret.The confusion matrix
helps understand the model's weaknesses, guiding further improvements in
model training, feature engineering, or data collection.
【Trivia】
The confusion matrix was originally developed for binary classification but
has been widely extended to multi-class classification tasks.Its name comes
from its ability to capture where a model is "confused" about the true labels,
providing deeper insights than simple accuracy scores.
45. Visualizing the Distribution of Customer
Product Preferences
Importance★★★★☆
Difficulty★★☆☆☆
A retail company wants to analyze customer preferences across different
product categories.Your task is to visualize the distribution of customers
who prefer each product category.You need to generate synthetic data that
simulates customer preferences for three product
categories:"Electronics""Clothing""Home and Garden"The company is
looking for an effective way to understand the percentage of customers
interested in each category to optimize its marketing strategy.Please
generate a bar plot showing the distribution of these preferences.
【Data Generation Code Example】
import numpy as np

import pandas as pd

# Generate 1000 customers


customers = 1000

# Randomly assign customers to one of the three categories

categories = np.random.choice(['Electronics', 'Clothing', 'Home and


Garden'], customers, p=[0.4, 0.35, 0.25])

# Create a DataFrame with the generated data

df = pd.DataFrame({'Category': categories})
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt


# Generate 1000 customers

customers = 1000

categories = np.random.choice(['Electronics', 'Clothing', 'Home and


Garden'], customers, p=[0.4, 0.35, 0.25])
# Create a DataFrame with the generated data

df = pd.DataFrame({'Category': categories})

# Count the occurrences of each category

category_counts = df['Category'].value_counts()

# Plot the distribution as a bar chart

plt.figure(figsize=(8,6))

plt.bar(category_counts.index, category_counts.values, color=['blue',


'green', 'orange'])

plt.title('Customer Product Category Preferences')

plt.xlabel('Product Category')
plt.ylabel('Number of Customers')

plt.show()

In this exercise, we aim to visualize the distribution of customer preferences


across three product categories.The goal is to provide insights into how
customers are distributed among "Electronics," "Clothing," and "Home and
Garden" categories.This is valuable for any company to understand the
demand for different product lines and adjust their marketing strategies
accordingly.The problem is solved by generating synthetic data that mimics
customer preferences.Here, we randomly assign customers to one of the
three product categories based on pre-defined probabilities using
np.random.choice.The generated data is then stored in a pandas
DataFrame.Using pandas, we count how many customers fall into each
category with the value_counts() method.Finally, we use matplotlib to
generate a bar plot of the category counts, allowing us to visualize the
distribution clearly.This exercise emphasizes basic data generation,
counting occurrences in a dataset, and visualizing the results.These are
common tasks in machine learning workflows where understanding data
distribution is crucial.Visualizing class distributions is especially useful for
detecting imbalances in datasets, which can influence the performance of
machine learning models.
【Trivia】
In machine learning, imbalanced datasets can lead to biased models where
the predictions favor the majority class.This makes visualizing the
distribution of classes an essential step before training models, especially in
classification tasks!
46. Visualizing Steps in a Machine Learning Data
Pipeline
Importance★★★★★
Difficulty★★★☆☆
A retail company is analyzing customer behavior based on their purchasing
history to predict future product demand.You are tasked with creating a
machine learning data pipeline that includes basic steps like data
preprocessing, feature scaling, and training a linear regression model to
predict total monthly sales.Finally, your task is to visualize how the features
and predicted results look after preprocessing and scaling.Your pipeline
must include the following steps:Create a dataset with random sales data
(e.g., product_price, units_sold, and store_id).Preprocess the data by
normalizing it.Fit a linear regression model to predict the total sales
(product_price * units_sold).Visualize the original vs. preprocessed data
using scatter plots.
【Data Generation Code Example】
import numpy as np

import pandas as pd

##Create random sample data

np.random.seed(42)

data = pd.DataFrame({

'product_price': np.random.randint(5, 500, 100),


'units_sold': np.random.randint(1, 100, 100),

'store_id': np.random.randint(1, 10, 100)

})
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

from sklearn.preprocessing import MinMaxScaler


from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt

##Create random sample data


np.random.seed(42)

data = pd.DataFrame({

'product_price': np.random.randint(5, 500, 100),

'units_sold': np.random.randint(1, 100, 100),

'store_id': np.random.randint(1, 10, 100)


})

##Feature scaling using MinMaxScaler

scaler = MinMaxScaler()

data_scaled = scaler.fit_transform(data[['product_price', 'units_sold']])

##Linear regression model

X = data_scaled

y = data['product_price'] * data['units_sold']

model = LinearRegression()

model.fit(X, y)

##Create scatter plots of original vs. scaled data

plt.figure(figsize=(10, 5))

##Plot original data


plt.subplot(1, 2, 1)

plt.scatter(data['product_price'], data['units_sold'], color='blue')

plt.title('Original Data')

plt.xlabel('Product Price')

plt.ylabel('Units Sold')

##Plot scaled data

plt.subplot(1, 2, 2)

plt.scatter(data_scaled[:, 0], data_scaled[:, 1], color='red')

plt.title('Scaled Data')
plt.xlabel('Product Price (Scaled)')

plt.ylabel('Units Sold (Scaled)')

##Display the plots

plt.tight_layout()

plt.show()

In this exercise, we created a simple data pipeline to predict total sales


based on product price and units sold.
We used the numpy library to generate random sales data, which includes
features like product_price, units_sold, and store_id.
To preprocess the data, we employed the MinMaxScaler from the
sklearn.preprocessing module.
This scaler transforms features by scaling each feature to a range between 0
and 1, making the data suitable for training a machine learning model.
Next, we used LinearRegression from sklearn.linear_model to fit a model
that predicts total sales (the product of product_price and units_sold).
Linear regression is one of the simplest machine learning models, useful for
predicting continuous outcomes.
Finally, we visualized both the original data and the scaled data using
matplotlib scatter plots.
Scatter plots help to understand the relationship between two numerical
variables. In this case, we compared the features before and after
preprocessing.
This type of visualization can highlight the effects of scaling, as it
standardizes the range of features without distorting their relationships.
Through this exercise, you learned about key machine learning pipeline
steps like preprocessing, model training, and feature scaling, as well as how
to visualize data before and after transformations.
【Trivia】
Did you know that scaling features before feeding them into machine
learning algorithms is crucial for many models?
Algorithms like K-Nearest Neighbors and Support Vector Machines are
sensitive to feature magnitude, and incorrect scaling can lead to poor model
performance.
47. Comparing Model Performance Metrics: A
Practical Approach
Importance★★★★★
Difficulty★★★☆☆
A company wants to classify whether an email is spam or not based on a
series of features such as the frequency of certain words.They have two
machine learning models to solve this classification problem: Logistic
Regression and Random Forest.Your task is to compare the performance of
both models using accuracy, precision, recall, and F1 score on the same
dataset.Generate synthetic data that represents the email classification
problem, train both models, and visualize their performance in a single bar
plot for easier comparison.Ensure that your code generates this plot.
【Data Generation Code Example】
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split


from sklearn.datasets import make_classification

## Generate synthetic classification data for the email classification


problem

X, y = make_classification(n_samples=1000, n_features=20, n_classes=2,


random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd
from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from sklearn.datasets import make_classification

from sklearn.metrics import accuracy_score, precision_score,


recall_score, f1_score

import matplotlib.pyplot as plt

## Generate synthetic classification data


X, y = make_classification(n_samples=1000, n_features=20, n_classes=2,
random_state=42)

## Split the data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

## Initialize the models

lr = LogisticRegression()

rf = RandomForestClassifier()

## Train the models

lr.fit(X_train, y_train)

rf.fit(X_train, y_train)

## Make predictions

y_pred_lr = lr.predict(X_test)

y_pred_rf = rf.predict(X_test)

## Calculate metrics for both models

metrics = {

'Accuracy': [accuracy_score(y_test, y_pred_lr), accuracy_score(y_test,


y_pred_rf)],

'Precision': [precision_score(y_test, y_pred_lr),


precision_score(y_test, y_pred_rf)],

'Recall': [recall_score(y_test, y_pred_lr), recall_score(y_test,


y_pred_rf)],

'F1 Score': [f1_score(y_test, y_pred_lr), f1_score(y_test, y_pred_rf)]


}

## Convert metrics into a DataFrame

metrics_df = pd.DataFrame(metrics, index=['Logistic Regression',


'Random Forest'])

## Plot the metrics comparison

metrics_df.plot(kind='bar', figsize=(10,6))

plt.title('Model Performance Comparison')

plt.ylabel('Score')

plt.xticks(rotation=0)

plt.legend(loc='lower right')

plt.show()

In this exercise, we are comparing the performance of two classification


models, Logistic Regression and Random Forest, using several evaluation
metrics: accuracy, precision, recall, and F1 score.The goal of this task is to
visualize the comparative performance of these models using synthetic data.
First, we use the make_classification function to generate synthetic data.
The dataset contains 1,000 samples and 20 features, simulating an email
classification problem where the task is to predict if an email is spam or not
(binary classification).
After generating the data, we split it into training and test sets using the
train_test_split method.
This ensures that both models are evaluated on unseen data.
Next, we initialize two machine learning models, Logistic Regression and
Random Forest, and train them using the training data.
Once the models are trained, we make predictions on the test set for both
models.
We then compute the performance metrics for each model. These metrics
are critical in evaluating the models' effectiveness, as they give insight into
different aspects of performance:‣ Accuracy: Measures how many
predictions were correct overall.‣ Precision: Measures how many predicted
positives were actual positives.‣ Recall: Measures how many actual
positives were correctly predicted.‣ F1 Score: The harmonic mean of
precision and recall, balancing the two.
Finally, we visualize the performance of both models using a bar plot. This
allows us to easily compare the scores of each metric side by side.The bar
chart is created using matplotlib, where the four metrics for each model are
plotted. The legends, axis labels, and title are all in English, making it clear
which model performs better for each metric.
By doing this exercise, you learn how to evaluate different machine
learning models and compare their performance using standard metrics in
classification problems.
【Trivia】
Did you know that Random Forest classifiers, because of their ensemble
nature, often outperform simpler models like Logistic Regression in
classification tasks where the data is complex and nonlinear?However,
Logistic Regression can still be highly effective in cases with simple linear
relationships and tends to be more interpretable. Both models have their
advantages depending on the problem!
48. Visualizing the Effect of Noise on Model
Performance in Machine Learning
Importance★★★★☆
Difficulty★★★☆☆
A retail company is trying to build a machine learning model that predicts
customer satisfaction based on various features such as purchase amount,
number of visits, and customer ratings. However, they are facing an issue
where their data has some noisy features (random, irrelevant, or corrupted
data points), which may impact the model's performance. Your task is to
demonstrate how noise in the data affects the accuracy of the
model.Generate synthetic data representing customer satisfaction, where
one feature has been intentionally corrupted with noise. Then, train a
machine learning model to predict satisfaction and visualize how the
model’s performance changes as the level of noise increases.‣ Plot the
model's accuracy or performance metric against varying noise levels.
‣ The focus is on understanding the relationship between noise in data and
the model's ability to make accurate predictions.

【Data Generation Code Example】


import numpy as np
from sklearn.model_selection import train_test_split

from sklearn.datasets import make_regression

# Create synthetic data with a slight noise


X, y = make_regression(n_samples=1000, n_features=2, noise=0.1,
random_state=42)

X[:, 1] = X[:, 1] + np.random.normal(0, 0.5, size=X[:, 1].shape) #


Adding noise to one feature
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.datasets import make_regression

from sklearn.metrics import mean_squared_error

# Generate synthetic data


X, y = make_regression(n_samples=1000, n_features=2, noise=0.1,
random_state=42)

# Define a function to add increasing noise

def add_noise(X, noise_level):

return X + np.random.normal(0, noise_level, X.shape)

# Lists to store results

noise_levels = np.linspace(0, 2, 10)

mse_values = []

# Train the model at different noise levels and calculate MSE

for noise_level in noise_levels:


X_noisy = add_noise(X, noise_level)

X_train, X_test, y_train, y_test = train_test_split(X_noisy, y,


test_size=0.2, random_state=42)

model = LinearRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

mse_values.append(mse)

# Plot the results

plt.figure()

plt.plot(noise_levels, mse_values, marker='o')

plt.title('Model Performance vs Noise Level')


plt.xlabel('Noise Level')

plt.ylabel('Mean Squared Error')

plt.grid(True)

plt.show()

In this exercise, the goal is to demonstrate the impact of noise on a machine


learning model’s performance.
We generate synthetic data using the make_regression function, which
creates a dataset suitable for linear regression tasks.
This dataset contains two features, and noise is added to one of the features
to simulate the effects of corrupted or irrelevant data.
Noise can be thought of as random variations or inaccuracies that make it
harder for the model to learn meaningful patterns.
A linear regression model is trained multiple times with increasing noise
levels applied to the input data.
Noise is added using the np.random.normal function, which generates
random values from a normal (Gaussian) distribution.
As the noise level increases, the model's performance is measured by
calculating the Mean Squared Error (MSE) on the test data.
MSE is a commonly used metric to evaluate regression models,
representing the average squared difference between the actual and
predicted values.
The relationship between noise levels and model performance is then
visualized by plotting the MSE values against the noise levels.
This visualization highlights how increasing noise degrades the model’s
ability to make accurate predictions, making it an important concept when
dealing with noisy or unreliable datasets in machine learning.
Understanding how noise affects model performance can help data
scientists identify when they need to clean or preprocess data to improve
model accuracy.
This can be done by removing outliers, using feature engineering
techniques, or applying noise reduction methods like principal component
analysis (PCA) before training models.
【Trivia】
In real-world scenarios, noise can come from various sources, such as
sensor malfunctions, human errors in data entry, or random fluctuations in
measurements. Machine learning models are often highly sensitive to noise,
so understanding and managing noise is crucial for building reliable
predictive models.
49. Creating a Classifier Using Ensemble Methods
with Visualization
Importance★★★★☆
Difficulty★★★☆☆
A clothing company wants to predict whether a customer will buy a certain
product based on features like age, income, and previous purchase
behavior.To address this, you are tasked with creating a classifier using
ensemble methods (RandomForestClassifier and
GradientBoostingClassifier) to improve prediction accuracy. After training
the model, visualize the feature importance from both classifiers on the
same plot to compare which features are the most important for each
model.Create synthetic data that includes customer age, income, number of
previous purchases, and whether they purchased a product or not (binary
output).Your task is to:Train both classifiers using the synthetic
data.Visualize the feature importance of both classifiers on the same plot.
【Data Generation Code Example】
import numpy as npimport pandas as pdnp.random.seed(42)age =
np.random.randint(18, 70, 1000)income = np.random.randint(20000,
120000, 1000)prev_purchases = np.random.randint(0, 20, 1000)purchase
= np.random.choice([0, 1], 1000)data = pd.DataFrame({'Age': age,
'Income': income, 'PreviousPurchases': prev_purchases, 'Purchased':
purchase})
【Diagram Answer】

【Code Answer】
import numpy as npimport pandas as pdimport matplotlib.pyplot as
pltfrom sklearn.ensemble import RandomForestClassifier,
GradientBoostingClassifierfrom sklearn.model_selection import
train_test_splitfrom sklearn.preprocessing import
StandardScalernp.random.seed(42)age = np.random.randint(18, 70,
1000)income = np.random.randint(20000, 120000, 1000)prev_purchases
= np.random.randint(0, 20, 1000)purchase = np.random.choice([0, 1],
1000)data = pd.DataFrame({'Age': age, 'Income': income,
'PreviousPurchases': prev_purchases, 'Purchased': purchase})X =
data[['Age', 'Income', 'PreviousPurchases']]y = data['Purchased']X_train,
X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)scaler = StandardScaler()X_train_scaled =
scaler.fit_transform(X_train)X_test_scaled =
scaler.transform(X_test)clf_rf =
RandomForestClassifier()clf_rf.fit(X_train_scaled, y_train)clf_gb =
GradientBoostingClassifier()clf_gb.fit(X_train_scaled,
y_train)importances_rf = clf_rf.feature_importances_importances_gb =
clf_gb.feature_importances_features = X.columnsplt.figure(figsize=(10,
5))plt.barh(features, importances_rf, color='blue', alpha=0.5,
label='Random Forest')plt.barh(features, importances_gb, color='red',
alpha=0.5, label='Gradient
Boosting')plt.xlabel('Importance')plt.title('Feature Importance: Random
Forest vs Gradient Boosting')plt.legend()plt.show()

In this exercise, you are asked to train two ensemble classifiers:


RandomForestClassifier and GradientBoostingClassifier. Ensemble
methods are powerful techniques because they combine the predictions of
multiple models to improve accuracy and robustness.First, we create
synthetic data that mimics real customer data. The data consists of features
such as age, income, number of previous purchases, and the target variable
'Purchased' that represents whether the customer made a purchase or
not.Next, we split the data into training and testing sets using
train_test_split, which is essential for evaluating model performance on
unseen data. We also scale the data using StandardScaler to standardize the
features. This scaling is especially helpful for algorithms like Gradient
Boosting, which can be sensitive to feature scaling.We train two ensemble
classifiers: RandomForestClassifier and GradientBoostingClassifier.
Random forests use a collection of decision trees, each trained on random
subsets of the data and features, while gradient boosting trains trees
sequentially, with each tree improving upon the mistakes of the previous
one.Once both classifiers are trained, we access their feature importance
values using the feature_importances_ attribute. Feature importance reflects
how much a particular feature contributes to the model’s prediction, which
can provide insights into which factors are most influential in predicting
whether a customer will make a purchase.Finally, we visualize the feature
importance from both classifiers using a horizontal bar chart. In this chart,
the feature importance from RandomForestClassifier is shown in blue,
while that from GradientBoostingClassifier is shown in red. This allows us
to easily compare how the two classifiers prioritize different features.
【Trivia】
Random forests are known for being less prone to overfitting than
individual decision trees because they combine the outputs of many trees,
averaging out their predictions.
50. Visualizing Data Generation for Binary
Classification Problems
Importance★★★★☆
Difficulty★★★☆☆
A retail company wants to improve its customer satisfaction by predicting
whether a customer will make a purchase based on past behavior. You are
asked to create a simple binary classification model using synthetic data to
simulate customer behavior.Generate a dataset with two features
representing customer activity (e.g., website clicks, time spent on page),
and classify whether the customer made a purchase. Visualize the data in a
scatter plot, separating the two classes with distinct colors. Create a simple
logistic regression model to predict the outcome.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_classification

## Create synthetic binary classification data


X, y = make_classification(n_samples=200, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1,
random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

from sklearn.datasets import make_classification

from sklearn.linear_model import LogisticRegression

import matplotlib.pyplot as plt


## Create synthetic binary classification data

X, y = make_classification(n_samples=200, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1,
random_state=42)
## Fit logistic regression model

clf = LogisticRegression()

clf.fit(X, y)

## Create a scatter plot of the data

plt.figure(figsize=(8,6))

plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolor='k')

plt.title('Customer Purchase Prediction')

plt.xlabel('Feature 1')

plt.ylabel('Feature 2')

plt.show()

In this problem, we generate a synthetic dataset using make_classification,


which creates features (X) and corresponding binary labels (y). This dataset
simulates customer behavior with two features representing their activities,
and a binary outcome representing whether they made a purchase.Next, we
fit a LogisticRegression model using scikit-learn, a simple yet effective
classification algorithm that predicts the probability of a binary outcome.
The logistic regression model is trained on the synthetic data using the fit()
function.To visualize the data, we use matplotlib to create a scatter plot.
Each point represents a customer, and the color of the points corresponds to
the classification labels (0 or 1). The two features are plotted along the x
and y axes, while the logistic regression model is used to make predictions
based on these features.The scatter plot provides a visual representation of
the two classes, helping us understand the data distribution and how the
model separates the two groups. Logistic regression draws a linear decision
boundary between the two classes, though it is not explicitly shown
here.This exercise demonstrates the basics of data generation for
classification problems, model fitting, and visualizing results, which are
crucial skills in machine learning for practical applications like customer
behavior prediction.
【Trivia】
Logistic regression is widely used for binary classification problems,
especially in fields like marketing, medical diagnostics, and fraud detection,
because it is interpretable and easy to implement.
51. Visualizing Time Series Model Performance
Using Python
Importance★★★★☆
Difficulty★★★☆☆
You are tasked with helping a retail client analyze and visualize the
performance of their sales forecast models.The client wants to compare the
predicted values generated by a simple linear regression model and an
ARIMA model with the actual sales data.Generate synthetic time series data
representing actual sales and forecasted values for both models over 100
days.Your job is to create this synthetic dataset and visualize the
performance of the two models by plotting the actual vs. predicted
sales.Ensure the plot includes both the actual sales and the predictions from
both models, along with a clear legend and labels.
【Data Generation Code Example】
import numpy as np

import pandas as pd

np.random.seed(42)

days = pd.date_range('2023-01-01', periods=100, freq='D')

actual_sales = np.random.normal(200, 20, 100).cumsum()

linear_pred = actual_sales + np.random.normal(0, 15, 100)

arima_pred = actual_sales + np.random.normal(0, 20, 100)

data = pd.DataFrame({'Date': days, 'Actual Sales': actual_sales, 'Linear


Model Prediction': linear_pred, 'ARIMA Model Prediction': arima_pred})
【Diagram Answer】

【Code Answer】
import matplotlib.pyplot as plt
import pandas as pd

import numpy as np

np.random.seed(42)

days = pd.date_range('2023-01-01', periods=100, freq='D')

actual_sales = np.random.normal(200, 20, 100).cumsum()

linear_pred = actual_sales + np.random.normal(0, 15, 100)

arima_pred = actual_sales + np.random.normal(0, 20, 100)

data = pd.DataFrame({'Date': days, 'Actual Sales': actual_sales, 'Linear


Model Prediction': linear_pred, 'ARIMA Model Prediction':
arima_pred})
plt.figure(figsize=(10,6))

plt.plot(data['Date'], data['Actual Sales'], label='Actual Sales',


color='blue', linewidth=2)

plt.plot(data['Date'], data['Linear Model Prediction'], label='Linear Model


Prediction', color='green', linestyle='--', linewidth=2)

plt.plot(data['Date'], data['ARIMA Model Prediction'], label='ARIMA


Model Prediction', color='red', linestyle=':', linewidth=2)
plt.title('Actual vs Predicted Sales Over Time')

plt.xlabel('Date')

plt.ylabel('Sales')

plt.legend()

plt.grid(True)

plt.xticks(rotation=45)
plt.tight_layout()

plt.show()

In this problem, the goal is to visualize the performance of two time series
forecasting models against actual data.
First, synthetic data is generated to represent 100 days of sales data, along
with the predicted values from two models (Linear Regression and
ARIMA).
The actual sales data is simulated using a random normal distribution that is
cumulatively summed to create a trend-like behavior.
For each model, we added noise to the actual data to simulate the
predictions, with different levels of error for the linear regression and
ARIMA models.
In the visualization, Matplotlib is used to plot the time series data, with each
line representing either the actual sales or predictions from the two models.
The actual sales data is plotted with a solid line, while predictions are
plotted with dashed or dotted lines to distinguish them visually.
Additional plot elements include the title, axis labels, a legend to identify
each line, and gridlines for better readability.
By examining the plot, we can easily compare the accuracy of each model
over time and identify periods where the models may have diverged from
actual sales trends.
【Trivia】
Time series models are essential in business forecasting.
A common challenge is selecting the right model, as different models
handle trend, seasonality, and noise differently.
52. Visualizing Text Classification Results Using a
Confusion Matrix
Importance★★★★☆
Difficulty★★★☆☆
A company is facing challenges in classifying customer feedback as either
"positive" or "negative."The company has implemented a basic text
classification model to automatically label the feedback.However,
management would like to visualize how well the model is
performing.Create a Python program that trains a simple text classification
model using a sample dataset and visualizes the results using a confusion
matrix.Your task is to create the dataset, train a logistic regression model,
and generate a confusion matrix plot to show the classification
performance.Use the confusion matrix to explain how often the model
correctly classifies feedback as positive or negative.
【Data Generation Code Example】
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

## Creating the sample dataset

X = ["I love this product", "This is a terrible experience", "I am very


satisfied", "Worst customer service ever", "Absolutely fantastic", "Not
worth the money", "I am happy with my purchase", "Disappointing
quality", "Great value for money", "I will never buy again"]

y = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] # 1 for positive, 0 for negative

## Splitting the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)
【Diagram Answer】

【Code Answer】
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

import matplotlib.pyplot as plt

## Creating the sample dataset

X = ["I love this product", "This is a terrible experience", "I am very


satisfied", "Worst customer service ever", "Absolutely fantastic", "Not
worth the money", "I am happy with my purchase", "Disappointing
quality", "Great value for money", "I will never buy again"]

y = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] # 1 for positive, 0 for negative

## Splitting the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

## Vectorizing the text data


vectorizer = CountVectorizer()

X_train_vect = vectorizer.fit_transform(X_train)

X_test_vect = vectorizer.transform(X_test)

## Training the logistic regression model

model = LogisticRegression()

model.fit(X_train_vect, y_train)
## Predicting on the test set

y_pred = model.predict(X_test_vect)

## Generating the confusion matrix

cm = confusion_matrix(y_test, y_pred)

## Plotting the confusion matrix

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=


["Negative", "Positive"])

disp.plot(cmap=plt.cm.Blues)

plt.title("Confusion Matrix for Text Classification")


plt.show()

The problem is designed to help beginners visualize the performance of a


text classification model using a confusion matrix.We start by creating a
simple dataset of customer feedback, with "positive" and "negative"
labels.These feedback comments are stored in a list called X, and the
corresponding labels (1 for positive, 0 for negative) are stored in another list
y.We split the data into training and test sets using train_test_split() from
sklearn.Next, we use a CountVectorizer to transform the raw text into a
numerical format that can be processed by the machine learning
algorithm.This vectorizer converts each text into a matrix of token counts,
which forms the input for the logistic regression model.Logistic regression
is used here because it's a simple and effective method for binary
classification tasks.After training the model on the vectorized text data, we
make predictions on the test set.The confusion matrix helps us understand
how well the model is performing.It shows the counts of true positive, true
negative, false positive, and false negative predictions.The function
ConfusionMatrixDisplay() from sklearn allows us to plot the confusion
matrix, making it easy to visualize the classification results.The confusion
matrix is displayed with labels "Positive" and "Negative", where the
diagonal elements represent correct classifications, and the off-diagonal
elements show the misclassifications.
【Trivia】
The confusion matrix is an excellent tool for understanding model
performance, especially in situations where false positives or false
negatives might have significant consequences (e.g., medical diagnoses or
fraud detection).
53. Impact of Sample Size on Model Accuracy in
Predicting Customer Purchases
Importance★★★★☆
Difficulty★★★☆☆
A retail company wants to predict whether a customer will make a purchase
based on their interaction data (e.g., time spent on the website, number of
items viewed, etc.).You are tasked with training a machine learning model
to classify customer purchase behavior.To understand how sample size
affects model accuracy, you will create datasets with varying sizes and train
a logistic regression model on each one.Visualize the accuracy of the model
as the dataset size increases.Use synthetic data for this task with features
such as 'time spent', 'items viewed', 'cart additions', and 'purchase
made'.Make sure to plot a graph showing the model's accuracy versus the
sample size.
【Data Generation Code Example】
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

np.random.seed(42)

# Create a synthetic dataset with random values

samples = lambda size: np.random.randint(1, 100, size=(size, 3))

labels = lambda size: np.random.randint(0, 2, size=(size,))


【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

np.random.seed(42)

# Generate synthetic dataset and labels


samples = lambda size: np.random.randint(1, 100, size=(size, 3))

labels = lambda size: np.random.randint(0, 2, size=(size,))

# Initialize variables to store sample sizes and accuracies

sample_sizes = [100, 500, 1000, 5000, 10000]

accuracies = []

# Loop through different sample sizes

for size in sample_sizes:

X, y = samples(size), labels(size)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

model = LogisticRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracies.append(accuracy_score(y_test, y_pred))

# Plot the result

plt.plot(sample_sizes, accuracies, marker='o')

plt.title('Effect of Sample Size on Model Accuracy')

plt.xlabel('Sample Size')

plt.ylabel('Accuracy')

plt.grid(True)

plt.show()
In this task, we aim to understand the relationship between sample size and
model accuracy using a machine learning classifier, Logistic Regression.We
begin by generating a synthetic dataset using random integers. This dataset
simulates customer interactions, where each data point contains three
features:‣ 'time spent on website'‣ 'items viewed'‣ 'cart additions'The target
label, 'purchase made', is a binary classification (0 or 1).The dataset size
varies from 100 to 10,000, allowing us to analyze how increasing data
impacts model accuracy.We use the train_test_split function to divide the
data into training and test sets, ensuring that the model can generalize well
on unseen data.A Logistic Regression model is then trained on the training
data, and predictions are made on the test data.The accuracy of the model's
predictions is calculated using accuracy_score.Finally, the results are
plotted, showing how the model’s accuracy changes as the sample size
increases.This visualization demonstrates that a larger dataset typically
improves the model’s performance by reducing variance and helping it
generalize better.
【Trivia】
The relationship between sample size and accuracy is often referred to as
the "bias-variance tradeoff." Small sample sizes can lead to overfitting,
where the model learns noise from the data instead of meaningful patterns,
leading to high variance and lower accuracy on new data. Larger datasets
reduce this risk, but they also increase the computational resources required
for training.
54. Visualizing Learning Curves for Machine
Learning Models
Importance★★★★★
Difficulty★★★☆☆
A retail company is analyzing customer purchasing patterns to improve its
recommendation system.You are tasked with comparing the performance of
two machine learning models: Logistic Regression and Random Forest.The
company wants to understand how the models' performance improves as the
amount of training data increases.Using Python, create synthetic data that
simulates customer purchases (binary classification).Then, train both
models and plot their learning curves.Ensure that the plot shows both
training and validation scores as a function of the training set size.
【Data Generation Code Example】
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.datasets import make_classification


# Create synthetic classification data with 1000 samples

X, y = make_classification(n_samples=1000, n_features=20,
n_informative=2, n_classes=2, random_state=42)

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt


from sklearn.model_selection import learning_curve

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

# Generate synthetic data

X, y = make_classification(n_samples=1000, n_features=20,
n_informative=2, n_classes=2, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)
# Define the models

logistic_model = LogisticRegression()

random_forest_model = RandomForestClassifier()

# Function to plot learning curves

def plot_learning_curve(estimator, X, y, title):

train_sizes, train_scores, test_scores = learning_curve(estimator, X, y,


cv=5, scoring='accuracy', train_sizes=np.linspace(0.1, 1.0, 10))

train_scores_mean = np.mean(train_scores, axis=1)

test_scores_mean = np.mean(test_scores, axis=1)

plt.plot(train_sizes, train_scores_mean, label='Training score')


plt.plot(train_sizes, test_scores_mean, label='Validation score')

plt.xlabel('Training Set Size')

plt.ylabel('Accuracy')
plt.title(title)

plt.legend()

plt.grid(True)

# Plot learning curves for both models

plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)

plot_learning_curve(logistic_model, X_train, y_train, 'Logistic


Regression Learning Curve')

plt.subplot(1, 2, 2)
plot_learning_curve(random_forest_model, X_train, y_train, 'Random
Forest Learning Curve')

plt.tight_layout()

plt.show()

In this exercise, we are comparing the performance of two models, Logistic


Regression and Random Forest, on a binary classification task.The goal is
to understand how each model's performance improves as the size of the
training dataset increases.This process is visualized using a learning curve,
which helps us determine whether the model is underfitting, overfitting, or
generalizing well.First, we generate synthetic classification data using
make_classification().This function produces random data for classification,
where n_samples=1000 creates 1000 data points, and n_features=20 defines
20 features per data point.After generating the data, we split it into training
and testing sets using train_test_split(), with 70% of the data used for
training and 30% for testing.Next, we define two models: a Logistic
Regression model and a Random Forest model.We use the learning_curve()
function from scikit-learn to compute the training and validation accuracies
at various training set sizes.The train_sizes parameter ensures we evaluate
model performance at 10 different points, ranging from 10% to 100% of the
training data.We then calculate the average accuracy for both the training
and validation sets.Finally, we plot the learning curves for both models.The
plot shows two lines: one for the training score and one for the validation
score.These curves reveal how each model performs as it sees more
data.For example, if the training accuracy is much higher than the
validation accuracy, it indicates overfitting.On the other hand, if both scores
are low, the model may be underfitting.The visualization helps us
understand whether our models are learning efficiently and where
improvements can be made.By comparing both models, we can select the
one that performs best for the company's recommendation system.
【Trivia】
Learning curves are essential for understanding model performance,
especially in the early stages of development.They provide insights into
how much data is needed to improve a model's performance and whether a
model is generalizing well to unseen data.
55. Visualizing the Sensitivity of Machine Learning
Model Predictions
Importance★★★★☆
Difficulty★★★☆☆
You are working as a data scientist for an e-commerce company. The
company is using a machine learning model to predict whether a customer
will make a purchase based on several features (age, income, and website
activity). Management is interested in understanding how sensitive the
model's predictions are to changes in customer income.Your task is to build
a logistic regression model to predict customer purchase behavior. Then,
you will visualize how the predicted probability of a purchase changes as
income increases while keeping other variables constant.Generate random
data for customer age, income, website activity, and purchase labels.
Visualize the model's sensitivity to income by plotting the relationship
between income and the probability of purchase.
【Data Generation Code Example】
import numpy as np

np.random.seed(0)

age = np.random.randint(18, 65, 100)

income = np.random.randint(20000, 120000, 100)

web_activity = np.random.randint(0, 50, 100)

purchase = np.random.randint(0, 2, 100)


X = np.column_stack((age, income, web_activity))

y = purchase
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler

# Generate synthetic data for the model

np.random.seed(0)

age = np.random.randint(18, 65, 100)


income = np.random.randint(20000, 120000, 100)

web_activity = np.random.randint(0, 50, 100)

purchase = np.random.randint(0, 2, 100)

X = np.column_stack((age, income, web_activity))

y = purchase

# Standardize the features to improve model performance

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Train a logistic regression model

model = LogisticRegression()

model.fit(X_scaled, y)

# Generate a range of income values for sensitivity analysis


income_range = np.linspace(20000, 120000, 100)

age_fixed = 35

web_activity_fixed = 20

# Prepare the input data with varying income while keeping other features
constant

X_test = np.array([[age_fixed, income, web_activity_fixed] for income in


income_range])

X_test_scaled = scaler.transform(X_test)

# Predict probabilities using the trained model

purchase_probabilities = model.predict_proba(X_test_scaled)[:, 1]
# Plot the results to visualize sensitivity to income changes

plt.plot(income_range, purchase_probabilities, label="Purchase


Probability")

plt.title("Sensitivity of Purchase Prediction to Income")

plt.xlabel("Income")

plt.ylabel("Predicted Probability of Purchase")

plt.legend()

plt.grid(True)

plt.show()

In this task, we are creating a logistic regression model to predict whether a


customer will make a purchase based on three input features: age, income,
and website activity. Logistic regression is commonly used for binary
classification problems (like predicting a purchase vs. no purchase).To start,
we generate synthetic data using NumPy to simulate real-world features.
The generated data represents customer age, income, website activity, and
purchase behavior (the target variable).The features are then standardized
using StandardScaler. Standardization ensures that all features have the
same scale, which is particularly important for logistic regression, as large
differences in feature scales can affect model performance.After training the
logistic regression model on the standardized data, we perform sensitivity
analysis by varying the income feature while keeping the other features (age
and website activity) constant. The purpose of this analysis is to visualize
how the predicted probability of purchase changes as income increases.To
generate predictions, we create a test dataset with the same constant age and
website activity values, but varying income levels. This helps us focus on
the effect of income alone. The logistic regression model outputs
probabilities, which are then plotted to visualize how sensitive the purchase
prediction is to changes in income.The resulting plot shows how the
predicted probability of purchase changes across different income levels.
This helps understand the impact of income on the model’s predictions,
providing insight into how much influence it has on customer purchase
behavior.
【Trivia】
Logistic regression is one of the simplest yet most effective models for
binary classification problems. While it assumes a linear relationship
between the input features and the log-odds of the outcome, it remains
highly interpretable, making it popular in various fields like finance and
healthcare.
56. Visualizing the Impact of Feature Engineering
on Classification Performance
Importance★★★★☆
Difficulty★★★☆☆
You are a data scientist working for a financial services company. The
company wants to predict whether a customer will default on their loan
based on some basic customer information. You are tasked with
experimenting with feature engineering techniques and visualizing how it
affects the performance of a machine learning model. Specifically, you are
asked to create a dataset with features such as age, income, and credit score,
then apply a logistic regression model and visualize the performance before
and after scaling the features. Your task is to:Generate a synthetic dataset
with 2000 samples, each having the features age, income, credit_score, and
the target variable default (0 or 1).Train a logistic regression model using
the raw data and display a confusion matrix.Apply feature scaling
(standardization), retrain the model, and display the new confusion
matrix.Visualize the comparison of the confusion matrices before and after
scaling.
【Data Generation Code Example】
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

# # Create synthetic data

np.random.seed(42)

age = np.random.randint(18, 70, 2000)

income = np.random.randint(20000, 120000, 2000)


credit_score = np.random.randint(300, 850, 2000)

default = (np.random.rand(2000) > 0.7).astype(int)

# # Create DataFrame

data = pd.DataFrame({'age': age, 'income': income, 'credit_score':


credit_score, 'default': default})
【Diagram Answer】

【Code Answer】
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns


from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix

# # Create synthetic data

np.random.seed(42)

age = np.random.randint(18, 70, 2000)

income = np.random.randint(20000, 120000, 2000)


credit_score = np.random.randint(300, 850, 2000)

default = (np.random.rand(2000) > 0.7).astype(int)

data = pd.DataFrame({'age': age, 'income': income, 'credit_score':


credit_score, 'default': default})

# # Split data into train and test sets

X = data[['age', 'income', 'credit_score']]

y = data['default']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

# # Train logistic regression model on raw data

model = LogisticRegression()

model.fit(X_train, y_train)

y_pred_raw = model.predict(X_test)

conf_matrix_raw = confusion_matrix(y_test, y_pred_raw)

# # Apply feature scaling

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

# # Train logistic regression model on scaled data

model_scaled = LogisticRegression()

model_scaled.fit(X_train_scaled, y_train)

y_pred_scaled = model_scaled.predict(X_test_scaled)
conf_matrix_scaled = confusion_matrix(y_test, y_pred_scaled)

# # Plot confusion matrices

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.heatmap(conf_matrix_raw, annot=True, fmt='d', cmap='Blues',


ax=axes[0])

axes[0].set_title('Confusion Matrix (Raw Data)')

axes[0].set_xlabel('Predicted')

axes[0].set_ylabel('Actual')

sns.heatmap(conf_matrix_scaled, annot=True, fmt='d', cmap='Blues',


ax=axes[1])

axes[1].set_title('Confusion Matrix (Scaled Data)')

axes[1].set_xlabel('Predicted')

axes[1].set_ylabel('Actual')

plt.tight_layout()

plt.show()

In this exercise, we focus on feature engineering and its impact on model


performance using a logistic regression classifier. The dataset generated
contains features such as age, income, and credit_score and the target
variable default, which indicates whether a customer defaults on their loan.
The main task involves visualizing how feature scaling affects the
performance of the model.First, the dataset is divided into input features (X)
and the target (y), and then split into training and testing sets. The logistic
regression model is trained on the raw, unscaled data, and predictions are
made for the test set. To evaluate model performance, a confusion matrix is
generated. The confusion matrix shows how well the model's predictions
align with the actual labels.Next, feature scaling is applied using
StandardScaler, which standardizes each feature to have a mean of 0 and a
standard deviation of 1. This step is crucial in machine learning when using
models like logistic regression that rely on distance-based calculations.
After scaling, the model is retrained on the scaled data, and a new confusion
matrix is generated.The two confusion matrices (before and after scaling)
are visualized side by side. This allows us to clearly see the impact of
feature scaling on the model's ability to correctly classify customers who
default and those who don't. Feature scaling generally improves the
performance of models that are sensitive to feature magnitudes, such as
logistic regression. By comparing these matrices, we can see the benefits of
standardizing input features for this type of model.
【Trivia】
Did you know? Feature scaling is not always required for all models! For
instance, tree-based models like decision trees and random forests are
insensitive to the scale of the features. However, for models like logistic
regression, SVMs, and k-NN, scaling can drastically improve performance.
57. Creating and Visualizing a Neural Network for
Customer Image Classification
Importance★★★★★
Difficulty★★★☆☆
A retail company wants to implement an image classification system to
automatically classify images of products into different categories. Your
task is to create a simple neural network that can classify images of
products (e.g., shoes, bags, or shirts) into three categories: "Shoes," "Bags,"
and "Shirts."The input data consists of 28x8 pixel grayscale images of
products. The goal is to build and train a neural network using Python and
visualize its performance by plotting a graph of the accuracy and loss
during training.Write a Python code that does the following:Create the data
representing a simple image classification problem (you can simulate
product categories using random data).Define and train a neural network
model with an appropriate architecture for image classification.Plot and
display the training and validation accuracy and loss over epochs.Make sure
to simulate the dataset and visualize the results clearly, focusing on the
image classification aspect and the performance visualization.
【Data Generation Code Example】
import numpy as np

import tensorflow as tf

from sklearn.model_selection import train_test_split

np.random.seed(42)

# Simulating random grayscale images with 28x8 pixels and 3 product


categories

X = np.random.rand(300, 28, 28, 1)

# Simulating labels for 3 categories (Shoes=0, Bags=1, Shirts=2)

y = np.random.randint(0, 3, 300)
# Splitting data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


【Diagram Answer】

【Code Answer】
import numpy as np

import tensorflow as tf

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from tensorflow.keras import layers, models

np.random.seed(42)

# Simulating random grayscale images with 28x8 pixels and 3 product


categories

X = np.random.rand(300, 28, 28, 1)


y = np.random.randint(0, 3, 300)

# Splitting data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


# Creating a simple neural network model

model = models.Sequential([

layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),

layers.MaxPooling2D((2, 2)),

layers.Conv2D(64, (3, 3), activation='relu'),

layers.MaxPooling2D((2, 2)),

layers.Flatten(),

layers.Dense(64, activation='relu'),

layers.Dense(3, activation='softmax')

])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

# Training the model and storing the training history


history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test,
y_test))

# Plotting accuracy and loss over epochs

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)

plt.plot(history.history['accuracy'], label='Train Accuracy')

plt.plot(history.history['val_accuracy'], label='Validation Accuracy')

plt.title('Accuracy over Epochs')

plt.xlabel('Epochs')
plt.ylabel('Accuracy')

plt.legend()

plt.subplot(1, 2, 2)

plt.plot(history.history['loss'], label='Train Loss')

plt.plot(history.history['val_loss'], label='Validation Loss')

plt.title('Loss over Epochs')

plt.xlabel('Epochs')

plt.ylabel('Loss')

plt.legend()
plt.show()

In this problem, we simulate a simple image classification scenario. A


dataset of 28x8 grayscale images is generated to mimic product images for
categories such as shoes, bags, and shirts. The labels for these images are
represented by integers 0, 1, and 2, which correspond to each product
category. We split the dataset into training and testing sets to prepare for the
model training process.The neural network architecture used for image
classification consists of several layers:‣ Convolutional layers (Conv2D)
are responsible for learning image features such as edges or textures. We
use two convolutional layers with 32 and 64 filters respectively, each
followed by a MaxPooling layer to reduce the spatial size of the features.‣
Flattening the output of the convolutional layers converts the 2D data into
1D, which can be fed into fully connected (dense) layers.‣ The Dense
layers include one hidden layer with 64 neurons and an output layer with 3
neurons (representing the three categories) and a softmax activation
function for multi-class classification.We compile the model using the
Adam optimizer and sparse categorical cross-entropy as the loss function
since it is a multi-class classification problem.The model is trained for 10
epochs, and during training, the accuracy and loss are monitored on both
the training and validation datasets. The training history is then plotted to
show how the accuracy and loss evolve over time.The plotted graphs
provide important insights into model performance:‣ The accuracy graph
shows how well the model is learning to classify the product images. If
validation accuracy increases and stabilizes, the model is learning
effectively.‣ The loss graph shows how the error decreases over time. A
large gap between training and validation loss might indicate overfitting,
meaning the model is performing well on the training data but poorly on
unseen data.By visualizing both accuracy and loss over time, we can
understand the model's learning progress and diagnose potential issues like
overfitting or underfitting.
【Trivia】
Neural networks are particularly well-suited for image classification tasks
due to their ability to capture spatial hierarchies in images through
convolutional layers. Models like convolutional neural networks (CNNs)
have drastically improved the performance of image recognition systems,
making them a cornerstone in fields like computer vision, autonomous
driving, and healthcare diagnostics.
58. Visualizing Aggregated Predictions from
Multiple Machine Learning Models
Importance★★★★☆
Difficulty★★★☆☆
You are working for a retail company that wants to forecast future sales for
different products.You have access to three different machine learning
models, and you are tasked with aggregating the predictions from these
models to visualize the results.The goal is to show how the predictions of
each model compare for the same dataset and whether they align or differ in
their forecasting.Generate sample data for three models with different
predictions and create a plot to visualize the aggregated results for
comparison.The models’ predictions can be simulated as random values,
and the task is to generate predictions for 20 days of future sales.Write
Python code to visualize the aggregated predictions of these three models.
【Data Generation Code Example】
import numpy as np

#Generating random predictions for three models

days = np.arange(1, 21)

model1_predictions = np.random.randint(100, 200, size=20)

model2_predictions = np.random.randint(90, 210, size=20)

model3_predictions = np.random.randint(110, 190, size=20)


【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

#Creating sample data for predictions

days = np.arange(1, 21)

model1_predictions = np.random.randint(100, 200, size=20)

model2_predictions = np.random.randint(90, 210, size=20)

model3_predictions = np.random.randint(110, 190, size=20)


#Plotting the predictions from three models

plt.plot(days, model1_predictions, label="Model 1 Predictions",


marker='o')

plt.plot(days, model2_predictions, label="Model 2 Predictions",


marker='s')

plt.plot(days, model3_predictions, label="Model 3 Predictions",


marker='^')

#Adding labels and title

plt.xlabel("Day")
plt.ylabel("Sales Predictions")

plt.title("Aggregated Sales Predictions from Multiple Models")

plt.legend()

#Displaying the plot

plt.grid(True)

plt.show()

In this problem, the task involves comparing the predictions from three
different machine learning models on a simulated dataset.The first step is to
create a dataset that represents the daily sales for 20 days.In reality, machine
learning models would be trained on past sales data, but here we simulate
these predictions using random integers.Three different models are created
by generating random predictions: model1_predictions,
model2_predictions, and model3_predictions.The visualization part uses
matplotlib to create a line plot showing how each model's predictions differ
across the 20 days.The plot() function is used to display the predictions of
each model, and different markers ('o', 's', '^') are used to distinguish the
models.Labels for the x-axis (Day) and y-axis (Sales Predictions) are added
to clarify the plot, and a legend is created to identify the lines for each
model.Finally, grid(True) is included to display a grid, making it easier to
interpret the values in the plot.This visualization allows for an easy
comparison of the aggregated predictions, helping to analyze whether the
models provide similar or divergent predictions over time.
【Trivia】
Aggregating results from multiple machine learning models is known as
ensemble learning.One common technique, called "stacking," uses the
predictions from multiple models to create a more robust final prediction,
often improving overall performance.
59. Visualizing the Relationship Between Features
and a Target in a Customer Churn Problem
Importance★★★★☆
Difficulty★★★☆☆
A telecommunications company is trying to understand the factors that lead
to customer churn (when customers stop using their service). You are tasked
with building a machine learning model to predict whether a customer will
churn based on various features like customer age, monthly charges,
contract type, and tenure.Your task is to create a classification model and
visualize the relationship between customer tenure and the probability of
churn. Use a logistic regression model and visualize the relationship by
plotting the predicted probabilities of churn based on customer
tenure.Generate a dataset that contains at least the following features:‣
'tenure': The number of months the customer has been with the company‣
'monthly_charges': The amount the customer pays per month‣ 'age': The
customer's age‣ 'churn': Whether the customer has churned or not (binary
target)Plot a graph showing the relationship between 'tenure' and the
predicted probability of churn.
【Data Generation Code Example】
import numpy as npimport pandas as pdnp.random.seed(42)n =
1000tenure = np.random.randint(1, 61, n)monthly_charges =
np.random.uniform(30, 100, n)age = np.random.randint(18, 75, n)churn =
np.random.choice([0, 1], n, p=[0.7, 0.3])data = pd.DataFrame({'tenure':
tenure, 'monthly_charges': monthly_charges, 'age': age, 'churn': churn})
【Diagram Answer】

【Code Answer】
import numpy as npimport pandas as pdimport matplotlib.pyplot as
pltfrom sklearn.linear_model import LogisticRegressionfrom
sklearn.model_selection import train_test_splitnp.random.seed(42)n =
1000tenure = np.random.randint(1, 61, n)monthly_charges =
np.random.uniform(30, 100, n)age = np.random.randint(18, 75, n)churn =
np.random.choice([0, 1], n, p=[0.7, 0.3])data = pd.DataFrame({'tenure':
tenure, 'monthly_charges': monthly_charges, 'age': age, 'churn': churn})#
Split the data into features and targetX = data[['tenure', 'monthly_charges',
'age']]y = data['churn']# Split data into training and testing setsX_train,
X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)# Initialize and train a logistic regression modelmodel =
LogisticRegression()model.fit(X_train, y_train)# Generate predicted
probabilities for churn based on tenureX_test_sorted =
X_test.sort_values(by='tenure')predicted_probs =
model.predict_proba(X_test_sorted)[:, 1]# Plot the relationship between
tenure and predicted churn probabilityplt.figure(figsize=(10,
6))plt.plot(X_test_sorted['tenure'], predicted_probs, label='Predicted
Probability of Churn')plt.xlabel('Customer Tenure
(Months)')plt.ylabel('Predicted Probability of Churn')plt.title('Churn
Probability vs Customer Tenure')plt.legend()plt.grid(True)plt.show()

This exercise demonstrates how to visualize the relationship between an


independent feature (in this case, customer tenure) and the target variable
(customer churn) by using logistic regression.First, a synthetic dataset is
generated with features like tenure, monthly charges, and age, and a binary
churn target variable, which indicates whether a customer churned (1) or
stayed (0). The logistic regression model is chosen because it is well-suited
for binary classification problems like churn prediction.After the dataset is
split into training and testing sets, we train the model on the training data.
Logistic regression models output probabilities for binary outcomes, and we
are particularly interested in the predicted probability that a customer will
churn based on their tenure.Once the model is trained, we sort the test data
by tenure and calculate the predicted probability of churn for each
customer. This allows us to plot the relationship between customer tenure
and the predicted churn probability.The resulting plot shows how customer
tenure influences the likelihood of churn, offering insights that can help
businesses take action to retain long-term customers.
【Trivia】
Did you know that customer churn prediction models are widely used in
industries like telecommunications, banking, and subscription services? By
predicting which customers are likely to leave, companies can implement
retention strategies, potentially saving millions of dollars in lost revenue
annually!
60. Visualizing Model Training Accuracy and Loss
Over Time
Importance★★★★★
Difficulty★★★☆☆
A customer wants to understand the performance of their machine learning
model during training, with a focus on both accuracy and loss trends over
multiple epochs.Your task is to train a neural network on a dataset, track the
training accuracy and loss for each epoch, and visualize both trends over
time.Generate the data, train the model, and plot the accuracy and loss
curves.The customer is particularly interested in seeing whether the model
shows signs of overfitting or underfitting.You should generate the dataset
and training process yourself, and the code should display the accuracy and
loss plots on the same graph with two y-axes.
【Data Generation Code Example】
import numpy as npfrom sklearn.model_selection import
train_test_splitfrom sklearn.datasets import make_classificationimport
tensorflow as tfnp.random.seed(0)X, y =
make_classification(n_samples=1000, n_features=20, n_classes=2,
random_state=0)X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=0)
【Diagram Answer】

【Code Answer】
import numpy as npfrom sklearn.model_selection import
train_test_splitfrom sklearn.datasets import make_classificationimport
tensorflow as tfimport matplotlib.pyplot as plt## Create synthetic
datasetX, y = make_classification(n_samples=1000, n_features=20,
n_classes=2, random_state=0)X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2, random_state=0)## Define and
compile a simple neural network modelmodel =
tf.keras.Sequential([tf.keras.layers.Dense(16, activation='relu',
input_shape=(X_train.shape[1],)),tf.keras.layers.Dense(8,
activation='relu'),tf.keras.layers.Dense(1,
activation='sigmoid')])model.compile(optimizer='adam',
loss='binary_crossentropy', metrics=['accuracy'])## Train the model while
capturing historyhistory = model.fit(X_train, y_train, epochs=50,
validation_data=(X_test, y_test), verbose=0)## Extract accuracy and loss
from the training historyaccuracy =
history.history['accuracy']val_accuracy =
history.history['val_accuracy']loss = history.history['loss']val_loss =
history.history['val_loss']epochs = range(1, len(accuracy) + 1)## Plot
accuracy and lossfig, ax1 = plt.subplots()## Plot
accuracyax1.plot(epochs, accuracy, 'b-', label='Training
Accuracy')ax1.plot(epochs, val_accuracy, 'g-', label='Validation
Accuracy')ax1.set_xlabel('Epochs')ax1.set_ylabel('Accuracy',
color='b')ax1.tick_params('y', colors='b')## Create a twin y-axis for the
lossax = ax1.twinx()ax.plot(epochs, loss, 'r-', label='Training
Loss')ax.plot(epochs, val_loss, 'orange', label='Validation
Loss')ax.set_ylabel('Loss', color='r')ax.tick_params('y', colors='r')## Add
legends and show the plotfig.legend(loc="upper right", bbox_to_anchor=
(1,1), bbox_transform=ax1.transAxes)plt.title('Training and Validation
Accuracy and Loss Over Time')plt.show()

This exercise focuses on visualizing the performance of a machine learning


model during training by tracking both accuracy and loss over multiple
epochs.
The dataset is synthetically generated using make_classification from the
sklearn library, which creates a binary classification problem suitable for
training a simple neural network.
We split the data into training and testing sets using train_test_split to
simulate real-world model training and evaluation. This ensures that we can
observe both the model's performance on unseen data (validation set) and
its ability to learn from the training set.
The model is a basic feed-forward neural network built with TensorFlow's
Sequential API. It consists of two hidden layers with 16 and 8 units, using
the ReLU activation function to introduce non-linearity into the model. The
output layer uses a sigmoid activation function for binary classification.
We compile the model with the Adam optimizer, binary cross-entropy loss
(appropriate for binary classification), and accuracy as a performance
metric. The training process is controlled by the fit function, and we track
accuracy and loss for both training and validation sets over 50 epochs.
To visualize the performance, we use matplotlib to create a dual-axis plot.
The training and validation accuracy are plotted on the left y-axis, while the
training and validation loss are plotted on the right y-axis. This allows for a
clear comparison of how the model's accuracy improves over time and how
the loss decreases.
This plot can help detect overfitting or underfitting. For example, if the
validation loss begins to increase while the training loss continues to
decrease, it may indicate overfitting—where the model is performing well
on training data but poorly on unseen validation data. Conversely, if both
losses stay high and accuracy remains low, the model may be underfitting
and unable to capture the underlying patterns in the data.
By examining the plots, the user can assess whether the model is learning
efficiently and make necessary adjustments to the model architecture, the
amount of training data, or other hyperparameters to improve its
generalization performance.
【Trivia】
Overfitting is a common problem in machine learning where the model
performs exceptionally well on training data but poorly on new, unseen
data. Techniques like early stopping, dropout, and regularization are used to
mitigate overfitting.
61. Visualizing Residuals in a Regression Model for
Business Sales Prediction
Importance★★★★★
Difficulty★★★☆☆
You are working for a company that wants to predict its monthly sales using
a machine learning model based on the amount of advertising
expenditure.Your goal is to build a linear regression model to predict sales
from advertising spending.Once the model is trained, you must visualize the
residuals to evaluate the model’s performance.Residuals will help you
understand if there are patterns the model has missed and whether
assumptions of the regression model hold.Create a synthetic dataset with
two variables: advertising expenditure (in thousands of dollars) and sales
(in thousands of units).After building the regression model, create a scatter
plot to visualize the residuals (the difference between actual and predicted
values).
【Data Generation Code Example】
import numpy as np

import pandas as pd

np.random.seed(42)

advertising_spend = np.random.uniform(5, 100, 100)

sales = 3 * advertising_spend + np.random.normal(0, 10, 100)

df = pd.DataFrame({'Advertising Spend': advertising_spend, 'Sales':


sales})
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

np.random.seed(42)

advertising_spend = np.random.uniform(5, 100, 100)

sales = 3 * advertising_spend + np.random.normal(0, 10, 100)


df = pd.DataFrame({'Advertising Spend': advertising_spend, 'Sales':
sales})

# Build the regression model

model = LinearRegression()

X = df[['Advertising Spend']]

y = df['Sales']

model.fit(X, y)

# Predict the sales

y_pred = model.predict(X)

# Calculate the residuals


residuals = y - y_pred

# Plot residuals

plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='red', linestyle='--')

plt.title('Residuals vs Predicted Values')

plt.xlabel('Predicted Sales')

plt.ylabel('Residuals')

plt.show()

In this exercise, we first create a synthetic dataset where advertising


expenditure is the independent variable and sales are the dependent
variable.The regression model aims to predict sales based on advertising
spend.To evaluate how well the model fits the data, we calculate the
residuals, which represent the difference between the actual and predicted
sales values.Residuals are critical because they indicate whether the model
has captured all patterns in the data or if any bias or missing information
remains.In the code, we use LinearRegression() from scikit-learn to fit a
linear model.We train the model using the fit() function, passing in the
advertising expenditure as the predictor variable and sales as the
target.Next, we use predict() to generate the predicted sales based on the
model.Residuals are computed by subtracting the predicted values from the
actual sales values.The residuals are visualized using a scatter plot where
the predicted sales are plotted on the x-axis, and the residuals are plotted on
the y-axis.A red dashed line is drawn at y=0 to indicate the ideal scenario
where residuals should be centered around zero.If there are clear patterns in
the residuals, it suggests that the model is not fully capturing the data
patterns, which may indicate non-linearity or other issues in the model.The
lack of a clear pattern in the residual plot would indicate that the model is
appropriate for the data.This visualization helps in diagnosing issues like
heteroscedasticity (when the variance of residuals changes across values) or
non-linearity in the data.
【Trivia】
Residual plots are essential diagnostic tools in regression analysis.If
residuals are randomly scattered without a pattern, it suggests that the
model fits well.However, if a pattern appears (e.g., a U-shape), it indicates
the need for a non-linear model or data transformation.
62. Visualizing K-Means Clustering with Customer
Purchase Data
Importance★★★★☆
Difficulty★★★☆☆
A retail company wants to analyze its customers' purchase patterns and
segment them into different groups based on their behavior.
You are asked to use clustering techniques to group customers based on
their average purchase frequency and the total amount spent.
Use K-Means clustering to create customer segments and visualize the
clusters.
Your task is to generate synthetic data representing customers, apply K-
Means, and visualize the results to show the customer segments.

【Data Generation Code Example】


import numpy as np

np.random.seed(42)

customers = np.random.rand(200, 2) * [50, 1000]


【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

# Create sample customer data (frequency and amount spent)

np.random.seed(42)

customers = np.random.rand(200, 2) * [50, 1000]

# Apply KMeans clustering


kmeans = KMeans(n_clusters=4, random_state=42)

clusters = kmeans.fit_predict(customers)

# Plot the clusters

plt.scatter(customers[:, 0], customers[:, 1], c=clusters, cmap='viridis')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],


s=300, c='red', marker='X', label='Centroids')

plt.xlabel('Purchase Frequency')

plt.ylabel('Total Amount Spent')

plt.title('Customer Segments based on Purchase Patterns')

plt.legend()
plt.show()

This exercise introduces K-Means clustering, a popular unsupervised


machine learning algorithm. It is used to partition a dataset into distinct
groups (clusters) based on feature similarities. The algorithm assigns each
data point to the cluster with the nearest centroid, iteratively refining the
cluster assignments until convergence.
In this task, synthetic customer data is generated, representing purchase
frequency and total amount spent. The two features are chosen to reflect
typical customer behavior metrics in retail settings. After data generation,
K-Means is applied to cluster the customers into 4 groups. The clusters are
visualized using a scatter plot, where each point represents a customer, and
the color represents their assigned cluster. The red 'X' markers denote the
centroids of the clusters, which are the central points for each group.
The visualization helps the business understand the natural grouping of
customers based on their purchasing behavior, which can be used for
targeted marketing, customer service, and other business strategies. The
flexibility of the K-Means algorithm allows for easy experimentation with
different numbers of clusters. The results provide insight into which
customers behave similarly and might need similar marketing approaches.
【Trivia】
K-Means clustering tends to work well with large datasets and has linear
scalability. However, it may struggle with non-globular clusters or if the
clusters have varying densities. The choice of the number of clusters (k) can
significantly impact the results, and techniques like the Elbow Method or
Silhouette Score can help determine the optimal value.
63. Visualizing Time Series Predictions Using
Machine Learning
Importance★★★★★
Difficulty★★★☆☆
You are working for a retail company that wants to predict future sales
based on historical sales data.Your task is to create a predictive model that
forecasts the sales for the next 30 days based on past sales data for 90
days.Once the model predicts the future sales, you are required to visualize
both the actual and predicted sales on a line graph.Use machine learning
techniques to solve this problem and ensure that the graph clearly shows the
trend over time.Generate the sales data within the code and predict future
values using a simple machine learning algorithm (e.g., Linear
Regression).Make sure to visualize both the actual and predicted data on the
same graph.
【Data Generation Code Example】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

# # Generate sample sales data

np.random.seed(42)

days = np.arange(1, 91)

sales = np.random.normal(500, 50, 90).cumsum() + 1000 # Simulating


cumulative sales

# # Create a DataFrame

data = pd.DataFrame({'Day': days, 'Sales': sales})


【Diagram Answer】

【Code Answer】
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

# # Generate sample sales data

np.random.seed(42)

days = np.arange(1, 91)

sales = np.random.normal(500, 50, 90).cumsum() + 1000 # Simulating


cumulative sales
data = pd.DataFrame({'Day': days, 'Sales': sales})

# # Prepare training data

X = data['Day'].values.reshape(-1, 1)

y = data['Sales'].values

# # Train the linear regression model

model = LinearRegression()

model.fit(X, y)

# # Predict sales for the next 30 days

future_days = np.arange(91, 121).reshape(-1, 1)


predicted_sales = model.predict(future_days)

# # Plot actual and predicted sales

plt.figure(figsize=(10, 6))

plt.plot(days, sales, label='Actual Sales', color='blue')

plt.plot(np.arange(91, 121), predicted_sales, label='Predicted Sales',


color='red', linestyle='--')

plt.xlabel('Day')
plt.ylabel('Sales')

plt.title('Sales Prediction for Next 30 Days')

plt.legend()

plt.grid(True)

plt.show()
The goal of this task is to apply machine learning techniques to predict
future values for a time series dataset, specifically sales data. The model
chosen is Linear Regression, which is well-suited for detecting linear
relationships between variables like time and sales.First, we generate
synthetic sales data using a random normal distribution to simulate a
realistic sales pattern over 90 days.We store these values in a Pandas
DataFrame and prepare the data for the machine learning model by
reshaping the 'Day' column as the independent variable (X) and the 'Sales'
column as the dependent variable (y).The Linear Regression model is
trained on the existing data, which helps the model to identify patterns or
trends in sales growth over time. After training, we use the model to predict
sales for the next 30 days, which involves creating a new sequence of future
days and feeding it into the model.Finally, we visualize the results by
plotting both the actual sales and the predicted sales on a graph. The actual
sales are shown for the first 90 days, while the predicted values are plotted
as a dashed line for the next 30 days. This approach enables you to clearly
see how well the model's predictions align with the actual sales trend. The
gridlines, labels, and legend make the graph easier to interpret.
【Trivia】
Linear Regression is one of the simplest and most interpretable machine
learning algorithms. However, for more complex time series data, other
algorithms such as ARIMA, LSTM, or Prophet may provide more accurate
predictions.
64. Visualizing Model Predictions for House Price
Prediction
Importance★★★★☆
Difficulty★★★☆☆
You are working as a data scientist for a real estate agency that wants to
help clients understand the relationship between various house features
(e.g., size, number of rooms, age) and the predicted house prices. Your task
is to build a simple linear regression model using synthetic data and
visualize the model's predictions in comparison to the actual data.The
agency needs a visualization that allows clients to see the model's prediction
for house prices based on square footage. You will need to create a scatter
plot of actual house prices against square footage and overlay it with a line
showing predicted house prices.Using Python, generate synthetic data for
house prices and square footage, fit a linear regression model, and create
the required visualization.The agency expects you to clearly distinguish
between actual and predicted values in the plot, and ensure the plot is easy
for clients to interpret.
【Data Generation Code Example】
import numpy as np

import pandas as pd

#Generate random data for house prices and square footage


np.random.seed(42)

square_footage = np.random.randint(500, 4000, 100)

house_prices = square_footage * 200 + np.random.normal(10000, 25000,


100)

#Combine into a DataFrame

df = pd.DataFrame({'SquareFootage': square_footage, 'HousePrice':


house_prices})
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

#Generate synthetic data for house prices based on square footage

np.random.seed(42)

square_footage = np.random.randint(500, 4000, 100)


house_prices = square_footage * 200 + np.random.normal(10000, 25000,
100)

df = pd.DataFrame({'SquareFootage': square_footage, 'HousePrice':


house_prices})

#Fit a linear regression model

X = df[['SquareFootage']]

y = df['HousePrice']

model = LinearRegression()

model.fit(X, y)

#Generate predictions using the model

predicted_prices = model.predict(X)

#Create the plot

plt.scatter(df['SquareFootage'], df['HousePrice'], label='Actual Prices',


color='blue', alpha=0.6)

plt.plot(df['SquareFootage'], predicted_prices, label='Predicted Prices',


color='red', linewidth=2)

plt.title('Actual vs Predicted House Prices')

plt.xlabel('Square Footage')

plt.ylabel('House Price ($)')

plt.legend()

plt.grid(True)
plt.show()
This problem focuses on helping you understand how to use machine
learning models, particularly linear regression, to solve real-world
problems. Linear regression is one of the most basic models used in
machine learning to predict continuous values, and it is important to know
how to visualize the performance of the model using plots.First, synthetic
data for house prices and square footage is generated. In this case, the
square footage is a feature (independent variable), and the house price is the
target variable (dependent variable). The relationship between the two is
modeled by a simple linear regression.Next, the linear regression model
from sklearn is used to fit the relationship between square footage and
house prices. The model uses this relationship to predict house prices based
on the square footage of houses.Once the model is trained, you can use it to
make predictions on the same input data and compare these predictions to
the actual house prices. The matplotlib library is used to create a scatter plot
that shows actual house prices (in blue) and the predicted house prices (as a
red line). The red line represents the model's prediction based on the square
footage.By plotting the actual vs. predicted prices, you can visually inspect
how well the model has captured the underlying trend in the data. The goal
is to overlay the predicted values in such a way that clients can easily
interpret whether the model's predictions are close to reality.This kind of
visualization is useful not only for explaining models to non-technical
audiences but also for assessing model accuracy during development.
【Trivia】
Linear regression is one of the simplest types of machine learning models
but is extremely powerful in many scenarios. Despite its simplicity, it forms
the foundation for many more complex models like polynomial regression
and logistic regression. In real estate, linear models are commonly used to
estimate prices based on features like square footage, location, and number
of rooms.
65. Visualizing Outliers with Boxplots in Machine
Learning Data
Importance★★★★★
Difficulty★★★☆☆
You are working with a retail company that tracks daily sales data to
identify trends and improve customer service.The company wants to find
potential outliers in the sales data that could indicate either technical issues
or special circumstances like promotions.Your task is to visualize the sales
data using a boxplot to detect any significant outliers.Generate synthetic
sales data for 100 days, with sales numbers randomly distributed between
$500 and $1500, but include a few extreme outliers between $2000 and
$3000 to simulate these unusual events.Visualize this data with a boxplot
and explain how the boxplot helps to identify outliers.
【Data Generation Code Example】
import numpy as np

import pandas as pd

np.random.seed(42)

#Generate 100 random sales data between 500 and 1500

sales = np.random.randint(500, 1500, 100)

#Add outliers between 2000 and 3000 to simulate unusual sales

outliers = np.random.randint(2000, 3000, 5)

#Combine regular sales data with outliers

sales_data = np.concatenate([sales, outliers])


【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

np.random.seed(42)

#Generate 100 random sales data between 500 and 1500


sales = np.random.randint(500, 1500, 100)

#Add outliers between 2000 and 3000 to simulate unusual sales

outliers = np.random.randint(2000, 3000, 5)


#Combine regular sales data with outliers

sales_data = np.concatenate([sales, outliers])

#Create a boxplot to visualize the sales data and highlight outliers

plt.boxplot(sales_data)

plt.title('Sales Data with Outliers')

plt.ylabel('Sales in USD')

plt.show()

In this problem, we are using a boxplot to visualize the presence of outliers


in sales data.A boxplot is a statistical chart used to display the distribution
of data based on a five-number summary: minimum, first quartile (Q1),
median, third quartile (Q3), and maximum.The box represents the
interquartile range (IQR), which is the range between Q1 and Q3, capturing
the central 50% of the data.The line inside the box represents the median,
which is the middle value of the dataset.# In the context of sales data,
boxplots are useful for detecting outliers.Outliers are data points that fall far
outside the normal range of the dataset.In this case, we added a few extreme
sales values between $2000 and $3000 to simulate these outliers.When the
boxplot is generated, the extreme values appear as points beyond the
whiskers (lines extending from the box), indicating that they are
significantly higher than most of the data.This is useful in machine learning
when we want to understand data distributions, as outliers can affect model
performance if not handled properly.In machine learning, identifying and
handling outliers is essential because they can lead to inaccurate predictions
or skewed model results.For example, in a regression model, outliers could
influence the model's coefficients and lead to poor generalization.By
visualizing data through a boxplot, you can easily spot and investigate these
anomalies before training your model.Boxplots are a quick and efficient
way to check for such irregularities in your data, and they allow you to
assess whether you need to remove or correct outliers for better model
performance.
【Trivia】
Boxplots were first introduced by John Tukey in 1977 as part of
Exploratory Data Analysis (EDA).They are particularly useful for
comparing distributions across different categories, making them a versatile
tool in statistical analysis and machine learning preprocessing.
66. Visualizing Feature Importance in a Customer
Churn Prediction Model
Importance★★★★★
Difficulty★★★☆☆
You are working as a data scientist for a telecommunications company.The
company wants to predict customer churn and understand which features
contribute most to the prediction.You are tasked with creating a machine
learning model to predict customer churn based on features such as
customer service calls, tenure, contract type, and monthly charges.After
training the model, the management wants you to visualize the feature
importance to gain insights into what factors influence churn the most.Use a
RandomForestClassifier model to predict customer churn based on
randomly generated data.Then, visualize the feature importance to show
which features are most relevant in predicting churn.The task
involves:Generating synthetic data for the customer churn
prediction.Training a RandomForestClassifier.Visualizing the feature
importance in a bar plot.Write code to generate the data and the chart
showing feature importance.
【Data Generation Code Example】
import numpy as np

import pandas as pd

np.random.seed(42)

data = pd.DataFrame({'tenure': np.random.randint(1, 72, 1000),

'monthly_charges': np.random.uniform(20, 100, 1000),

'customer_service_calls': np.random.randint(0, 10, 1000),

'contract_type': np.random.choice(['month-to-month', 'one-year', 'two-


year'], 1000),

'churn': np.random.choice([0, 1], 1000)


})

data['contract_type'] = data['contract_type'].map({'month-to-month': 0,
'one-year': 1, 'two-year': 2})
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score


import matplotlib.pyplot as plt

np.random.seed(42)

data = pd.DataFrame({'tenure': np.random.randint(1, 72, 1000),


'monthly_charges': np.random.uniform(20, 100, 1000),

'customer_service_calls': np.random.randint(0, 10, 1000),

'contract_type': np.random.choice(['month-to-month', 'one-year', 'two-


year'], 1000),

'churn': np.random.choice([0, 1], 1000)

})

data['contract_type'] = data['contract_type'].map({'month-to-month': 0,
'one-year': 1, 'two-year': 2})

X = data.drop('churn', axis=1)

y = data['churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy}')

# Plot feature importancefeature_importances =


model.feature_importances_

features = X.columns

plt.barh(features, feature_importances)
plt.xlabel('Importance')

plt.title('Feature Importance for Customer Churn Prediction')


plt.show()

This exercise demonstrates how to train a machine learning model to


predict customer churn and how to visualize feature importance to gain
insights into which factors most impact predictions.First, a synthetic dataset
is generated using NumPy and pandas.The data contains 1,000 records with
features including tenure (number of months with the company), monthly
charges, customer service calls, contract type (month-to-month, one-year, or
two-year contracts), and a churn label (0 or 1) that indicates whether a
customer has churned.The contract type is converted from categorical to
numeric, mapping 'month-to-month' to 0, 'one-year' to 1, and 'two-year' to
2.Next, the dataset is split into training and test sets using the
train_test_split function.A RandomForestClassifier is chosen as the model
because it not only performs well in classification tasks but also provides a
way to compute feature importance.The model is trained on the training
data, and predictions are made on the test data.The accuracy of the model is
printed, which helps evaluate how well the model performs.To visualize
feature importance, the model's feature_importances_ attribute is
accessed.This attribute provides the relative importance of each feature in
making predictions.Finally, a horizontal bar plot is generated using
matplotlib, with the features displayed on the y-axis and their importance
on the x-axis.This plot helps to quickly understand which factors most
influence whether a customer will churn, such as customer tenure or
monthly charges.Random forests are an ensemble learning method that
combines many decision trees to reduce overfitting and improve
generalization.The feature importance values reflect how frequently a
feature is used in splitting the data across all trees in the forest.This allows
decision-makers to focus on the most relevant factors when devising
strategies to reduce churn.
【Trivia】
Did you know that Random Forests, despite being powerful, can be difficult
to interpret directly?Visualizing feature importance, as demonstrated here,
is one of the ways to make these models more interpretable to non-technical
stakeholders.
67. Visualizing Reinforcement Learning Agent's
Training Performance in a Maze
Importance★★★★☆
Difficulty★★★☆☆
You are working with a client who is developing a reinforcement learning
(RL) system to navigate through a maze. The goal is to help the client
visualize the agent's training process by showing how the agent's path
evolves over multiple episodes.
The agent will learn over several episodes, and the performance should be
evaluated in terms of the number of steps taken to reach the goal and the
path followed. You need to generate a plot that shows the evolution of the
agent's path through the maze across episodes.
Write Python code that does the following:
Simulate a simple 5x5 grid maze where the agent starts at the top-left corner
(0, 0) and needs to reach the bottom-right corner (4, 4). The maze has no
obstacles.
The RL agent follows a Q-learning approach, and its policy improves over
100 episodes.
Visualize how the agent's path changes from episode 1, episode 50, and
episode 100 using a plot. The plot should show the grid, the agent's path,
and the starting and goal points.
Do not import or use external datasets for this task; instead, generate the
necessary data within your code.

【Data Generation Code Example】


import numpy as np
np.random.seed(42)

grid_size = 5

def create_maze(): return np.zeros((grid_size, grid_size))


def generate_random_q_table(): return np.random.rand(grid_size,
grid_size, 4) # Q-table for 4 actions (up, down, left, right)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

np.random.seed(42)

grid_size = 5

start = (0, 0)

goal = (4, 4)

episodes = 100
actions = [(0, -1), (0, 1), (-1, 0), (1, 0)] # Left, Right, Up, Down

def create_maze(): return np.zeros((grid_size, grid_size))

def generate_random_q_table(): return np.random.rand(grid_size,


grid_size, 4) # 4 actions

def get_action_from_q(q_table, state): return np.argmax(q_table[state])

def take_action(state, action):

next_state = (state[0] + actions[action][0], state[1] + actions[action][1])

next_state = (max(0, min(next_state[0], grid_size-1)), max(0,


min(next_state[1], grid_size-1)))

return next_state

def simulate_episode(q_table):

state = start

path = [state]

while state != goal:

action = get_action_from_q(q_table, state)

state = take_action(state, action)

path.append(state)

return path

Simulate agent training

q_table = generate_random_q_table()

paths_over_episodes = []

for episode in range(1, episodes+1):


path = simulate_episode(q_table)

if episode in [1, 50, 100]: paths_over_episodes.append((episode, path))

Visualization

fig, axs = plt.subplots(1, 3, figsize=(15, 5))

for idx, (episode, path) in enumerate(paths_over_episodes):

maze = create_maze()

axs[idx].imshow(maze, cmap='gray_r')

for (x, y) in path: axs[idx].scatter(y, x, color='blue')

axs[idx].scatter(start[1], start[0], color='green', label="Start")


axs[idx].scatter(goal[1], goal[0], color='red', label="Goal")

axs[idx].set_title(f'Episode {episode}')

axs[idx].legend()

plt.show()

The code is designed to help visualize how a reinforcement learning agent's


path improves over multiple episodes of training. In reinforcement learning,
an agent learns by interacting with the environment and updating its policy.
In this case, the agent is learning to navigate a 5x5 maze using Q-learning, a
popular RL algorithm.
The maze is represented as a 5x5 grid, where the agent's task is to move
from the top-left corner (0,0) to the bottom-right corner (4,4). The agent can
take one of four possible actions in each state: moving left, right, up, or
down.
The Q-table is randomly initialized for this example, meaning the agent
starts with no knowledge of the best actions to take. Over time, the agent
uses its Q-table to make decisions, and its path gradually becomes more
optimal as it gains experience.
The function simulate_episode() uses the Q-table to decide which action the
agent should take in each step. It follows the learned policy by choosing the
action with the highest value in the Q-table for the current state. If the agent
reaches the goal (bottom-right corner), the episode ends.
To show the improvement over time, the agent’s path is plotted at three
different stages: after episode 1, episode 50, and episode 100. Each subplot
shows the agent's path on the grid, starting from the green point and ending
at the red goal point. As the episodes progress, you should notice that the
path becomes shorter and more direct.
The visualization is an essential tool for reinforcement learning because it
allows practitioners to see the agent's learning progress. As the policy
improves, the path taken by the agent should become more efficient,
reflecting the agent's ability to maximize its cumulative reward. This
method helps developers fine-tune the training process by monitoring the
agent’s behavior over time.
【Trivia】
Reinforcement learning is often inspired by how animals (and humans)
learn. The Q-learning algorithm, for instance, mirrors a trial-and-error
approach, where an agent learns which actions yield the best rewards based
on its past experiences. In real-world applications, Q-learning has been used
in robotics, game AI, and even self-driving cars!
68. Visualizing the Impact of Different Kernel
Functions in SVM
Importance★★★★☆
Difficulty★★★☆☆
You are a data scientist working for a company that provides customer
support through chat.The company wants to analyze customer sentiment
based on the text of the messages sent to support agents.You have been
tasked with building a Support Vector Machine (SVM) model to classify
whether a message is "positive" or "negative" using different kernel
functions and visualizing how the choice of kernel affects the decision
boundaries.Create synthetic data points representing positive and negative
sentiments and visualize the decision boundaries using linear, polynomial,
and RBF kernels.Your task is to:Create synthetic sentiment data with two
features.Train SVM models with different kernel functions (linear,
polynomial, RBF).Visualize the decision boundaries of these kernel
functions.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=200, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.svm import SVC

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import make_pipeline

import matplotlib.pyplot as plt

# Create synthetic data

X, y = make_classification(n_samples=200, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1)

# Define a mesh grid to visualize decision boundaries

h = .02

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1


xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min,
y_max, h))

kernels = ['linear', 'poly', 'rbf']

titles = ['Linear Kernel', 'Polynomial Kernel', 'RBF Kernel']

plt.figure(figsize=(15, 5))

# Loop through different kernels and plot decision boundaries

for i, kernel in enumerate(kernels):

clf = make_pipeline(StandardScaler(), SVC(kernel=kernel, degree=3,


gamma='auto'))

clf.fit(X, y)

Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

plt.subplot(1, 3, i + 1)

plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), 0, 7),


cmap=plt.cm.PuBu)

plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='black')

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm,


edgecolors='k')

plt.title(titles[i])

plt.tight_layout()

plt.show()

In this exercise, the goal is to understand how different kernel functions in


Support Vector Machines (SVMs) impact classification performance by
visualizing their decision boundaries.The SVM algorithm is a powerful
classification technique that finds the optimal hyperplane to separate classes
in the feature space.A kernel function in SVM is responsible for
transforming the data into a higher-dimensional space, which allows the
model to find non-linear decision boundaries.Here we use three different
kernel functions:‣ Linear Kernel: This kernel assumes that the data is
linearly separable. The hyperplane is a straight line. It works well when the
data can be separated by a linear boundary.‣ Polynomial Kernel: This
kernel can separate data points that are not linearly separable by introducing
polynomial terms. It can fit more complex decision boundaries based on the
degree of the polynomial.‣ RBF (Radial Basis Function) Kernel: This
kernel projects the data into an infinite-dimensional space. It is highly
flexible and can capture more complex patterns in the data but might lead to
overfitting on small datasets.We first generate synthetic data using
make_classification(), which creates a two-dimensional dataset with two
informative features.The dataset is then split and used to train SVM models
with different kernel functions.A mesh grid is created using np.meshgrid()
to visualize the decision boundaries, and we loop through different kernels,
training each model and plotting the resulting boundaries.This visualization
helps understand how the shape and flexibility of decision boundaries
change with each kernel function.Finally, the plt.contourf() and
plt.contour() functions are used to plot the decision boundaries and the
margins separating the classes, while plt.scatter() is used to plot the data
points.
【Trivia】
Did you know that the RBF kernel is essentially mapping data points into
an infinite-dimensional space?This means that the model can capture highly
complex relationships between the data points, which can be especially
useful when data is not linearly separable!
69. Visualizing Data Preprocessing in Machine
Learning
Importance★★★★★
Difficulty★★★☆☆
A retail company wants to analyze customer data to predict purchasing
behavior. They have a dataset with missing values and numerical outliers.
As a data scientist, your task is to preprocess this data using Python for a
machine learning model. You need to:Handle missing values using an
appropriate strategy.Scale the numerical features to bring them to a similar
range.Detect and remove outliers to improve model performance.Visualize
the data before and after preprocessing.Write code to preprocess the dataset
and visualize the changes before and after using a scatter plot. The input
data should be generated using random values for simplicity.
【Data Generation Code Example】
import numpy as np

import pandas as pd

# # Create random data with missing values and outliers

np.random.seed(0)

data = pd.DataFrame({

'Age': np.random.randint(18, 70, 100), # Random ages between 18 and


70

'Salary': np.random.normal(50000, 15000, 100) # Normally distributed


salary data

})

data.loc[np.random.choice(data.index, 10, replace=False), 'Age'] =


np.nan # Introduce missing values in 'Age'
data.loc[5:10, 'Salary'] *= 5 # Introduce outliers in 'Salary'
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt


from sklearn.preprocessing import StandardScaler

from sklearn.impute import SimpleImputer

# # Create random data with missing values and outliers


np.random.seed(0)

data = pd.DataFrame({

'Age': np.random.randint(18, 70, 100), # Random ages between 18 and


70

'Salary': np.random.normal(50000, 15000, 100) # Normally distributed


salary data
})

data.loc[np.random.choice(data.index, 10, replace=False), 'Age'] =


np.nan # Introduce missing values in 'Age'

data.loc[5:10, 'Salary'] *= 5 # Introduce outliers in 'Salary'

# # Visualize data before preprocessing

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)

plt.scatter(data.index, data['Age'], label='Age')

plt.scatter(data.index, data['Salary'], label='Salary', alpha=0.7)

plt.title('Before Preprocessing')
plt.xlabel('Index')

plt.ylabel('Values')

plt.legend()
# # Step 1: Handle missing values using median strategy

imputer = SimpleImputer(strategy='median')

data['Age'] = imputer.fit_transform(data[['Age']])

# # Step 2: Detect and remove outliers in 'Salary'

q1, q3 = data['Salary'].quantile([0.25, 0.75])

iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr

upper_bound = q3 + 1.5 * iqr


data = data[(data['Salary'] > lower_bound) & (data['Salary'] <
upper_bound)]

# # Step 3: Scale the data using StandardScaler

scaler = StandardScaler()

data[['Age', 'Salary']] = scaler.fit_transform(data[['Age', 'Salary']])

# # Visualize data after preprocessing

plt.subplot(1, 2, 2)

plt.scatter(data.index, data['Age'], label='Age (Scaled)')

plt.scatter(data.index, data['Salary'], label='Salary (Scaled)', alpha=0.7)

plt.title('After Preprocessing')

plt.xlabel('Index')

plt.ylabel('Values (Scaled)')

plt.legend()

# # Show the plot

plt.tight_layout()

plt.show()

In this exercise, we focus on data preprocessing, which is an essential step


in machine learning. Preprocessing helps ensure that the data is clean,
consistent, and ready for training models. Here's a detailed breakdown of
the steps:First, we create synthetic data with missing values and outliers
using random number generation. In real-world cases, datasets often contain
such imperfections, and handling them is critical.To visualize the initial
data, we use a scatter plot to show both the 'Age' and 'Salary' columns. This
gives us a quick understanding of any outliers or missing data.We address
missing values by using the SimpleImputer class from sklearn. The strategy
used here is 'median', which replaces missing values with the median of the
column. This method is less sensitive to outliers compared to the
mean.Next, we detect and remove outliers from the 'Salary' column. The
outliers are identified using the interquartile range (IQR), a common
technique in statistics. The data points that lie beyond 1.5 times the IQR
from the first or third quartiles are considered outliers and removed.After
outlier removal, we scale both 'Age' and 'Salary' using StandardScaler. This
step transforms the data so that it has a mean of 0 and a standard deviation
of 1, which is crucial when using algorithms sensitive to the scale of the
data (like SVM, k-NN, etc.).Finally, we visualize the processed data using
another scatter plot, allowing us to see the improvements, particularly the
removal of outliers and the effect of scaling.This workflow prepares the
data for machine learning models, ensuring the data is clean and
normalized, which typically leads to better model performance.
【Trivia】
Preprocessing isn't just about cleaning data—certain machine learning
algorithms, such as Support Vector Machines and K-Nearest Neighbors, can
produce significantly better results when data is properly scaled. This is
because these algorithms rely on distance-based calculations, which can be
distorted if features are on different scales.
70. Visualizing the Impact of Hyperparameters on a
Decision Tree Classifier
Importance★★★★☆
Difficulty★★★☆☆
A retail company wants to improve their customer churn prediction
model.They are currently using a decision tree classifier but are unsure how
tuning hyperparameters like max_depth and min_samples_split affects
model performance.You are tasked with building a decision tree classifier
and tuning these hyperparameters to visualize their effect on model
accuracy.Use synthetic data for customer features and a churn
target.Visualize how varying these hyperparameters influences model
performance using a plot that shows accuracy for different values of
max_depth and min_samples_split.Focus on finding the best combination
for improving churn prediction.
【Data Generation Code Example】
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.datasets import make_classification

## Generate synthetic classification data for customer churn

X, y = make_classification(n_samples=1000, n_features=10,
n_informative=5, n_classes=2, random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.tree import DecisionTreeClassifier

from sklearn.datasets import make_classification


import matplotlib.pyplot as plt
## Generate synthetic classification data

X, y = make_classification(n_samples=1000, n_features=10,
n_informative=5, n_classes=2, random_state=42)

## Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

## Define a range of hyperparameters for max_depth and


min_samples_split

param_grid = {'max_depth': range(1, 11), 'min_samples_split': range(2,


11)}

## Initialize a decision tree classifier

clf = DecisionTreeClassifier(random_state=42)

## Use GridSearchCV to find the best hyperparameters based on cross-


validation accuracy
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')

grid_search.fit(X_train, y_train)

## Extract the results from GridSearchCV

results = grid_search.cv_results_

## Create a meshgrid for plotting

depths = np.array(param_grid['max_depth'])

splits = np.array(param_grid['min_samples_split'])

scores = results['mean_test_score'].reshape(len(depths), len(splits))

## Plot the effect of hyperparameters on model accuracy


plt.figure(figsize=(8,6))

plt.imshow(scores, interpolation='nearest', cmap='viridis')

plt.colorbar(label='Accuracy')

plt.xticks(np.arange(len(splits)), splits)

plt.yticks(np.arange(len(depths)), depths)

plt.xlabel('min_samples_split')

plt.ylabel('max_depth')

plt.title('Effect of max_depth and min_samples_split on Accuracy')

plt.show()

This problem focuses on hyperparameter tuning of a decision tree classifier,


which is a commonly used algorithm in machine learning.We begin by
generating synthetic classification data using the make_classification
function.This function creates a dataset with informative features that
simulate real-world data, ideal for training machine learning models.We
split the data into training and testing sets using train_test_split to ensure
we have data for both model training and evaluation.Next, we define the
hyperparameters to tune: max_depth, which controls how deep the decision
tree can grow, and min_samples_split, which specifies the minimum
number of samples required to split a node.These hyperparameters have a
significant impact on the model’s performance, and finding the optimal
values is essential to prevent overfitting or underfitting.We use
GridSearchCV to automate the process of testing different combinations of
hyperparameters.It evaluates each combination using cross-validation,
which provides a more reliable estimate of model performance by testing on
different subsets of the data.The result is a grid of accuracy scores for
different values of max_depth and min_samples_split.The imshow function
from matplotlib is used to visualize these results, where color represents the
accuracy.By analyzing this heatmap, we can quickly identify the best
combination of hyperparameters that leads to the highest accuracy for
predicting customer churn.
【Trivia】
Decision trees are prone to overfitting, especially when they grow too
deep.This is why hyperparameter tuning like limiting max_depth is critical
to improving model generalization, ensuring that it performs well not just
on training data but also on unseen data.
71. Detecting Anomalies in Time Series Data Using
Machine Learning
Importance★★★★★
Difficulty★★★☆☆
A manufacturing company is monitoring its equipment performance and
wants to detect anomalies in the sensor data they collect over time. You are
tasked with building a Python machine learning model that detects unusual
patterns in time series data from the machinery. The company is particularly
interested in identifying sudden drops or spikes that may indicate
equipment malfunctions.Your job is to create synthetic time series data for
equipment sensors, train an isolation forest model to detect anomalies, and
visualize the detected anomalies in a line plot. The focus should be on
training the model to understand what normal sensor performance looks
like and flagging any deviations as potential anomalies.Use the provided
code to generate sensor data and then write the necessary machine learning
code to detect and plot the anomalies.
【Data Generation Code Example】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

#Generating time series data with some anomalies

np.random.seed(42)

time = pd.date_range(start='2023-01-01', periods=200, freq='D')


values = np.random.normal(100, 5, size=(200,))

values[50:55] = values[50:55] - 30 # sudden drop as anomaly

values[150:155] = values[150:155] + 30 # sudden spike as anomaly


#Creating a DataFrame

data = pd.DataFrame({'date': time, 'value': values})


【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.ensemble import IsolationForest

#Generating time series data with some anomalies

np.random.seed(42)

time = pd.date_range(start='2023-01-01', periods=200, freq='D')

values = np.random.normal(100, 5, size=(200,))

values[50:55] = values[50:55] - 30 # sudden drop as anomaly


values[150:155] = values[150:155] + 30 # sudden spike as anomaly

#Creating a DataFrame

data = pd.DataFrame({'date': time, 'value': values})

#Training Isolation Forest for anomaly detection

model = IsolationForest(contamination=0.05, random_state=42)

data['anomaly'] = model.fit_predict(data[['value']])

#Plotting the data with anomalies

plt.figure(figsize=(10, 6))

plt.plot(data['date'], data['value'], label='Sensor Values')

#Highlighting anomalies in the plot

anomalies = data[data['anomaly'] == -1]

plt.scatter(anomalies['date'], anomalies['value'], color='red',


label='Anomalies')
plt.title('Time Series Anomaly Detection')

plt.xlabel('Date')

plt.ylabel('Sensor Value')

plt.legend()

plt.grid(True)

plt.show()

In this exercise, you are detecting anomalies in time series data using the
Isolation Forest algorithm. Time series anomalies are important in various
industries because they help detect issues such as equipment failure,
network problems, or fraudulent activities.Synthetic Data Generation: The
synthetic time series data is created using numpy and pandas. The values
are mostly generated from a normal distribution (mean 100, standard
deviation 5), which simulates normal sensor behavior. Sudden drops and
spikes are manually added between indexes 50–55 and 150–155 to simulate
abnormal behavior.
Isolation Forest Algorithm: Isolation Forest is an unsupervised machine
learning algorithm designed for anomaly detection. It works by isolating
observations by randomly selecting a feature and splitting the data. The
more splits it takes to isolate a point, the less likely it is an anomaly. The
model is fitted using the fit_predict() method, which outputs -1 for
anomalies and 1 for normal data. We set the contamination level to 0.05,
meaning we expect about 5% of the data to be anomalies.
Visualization: The time series data is plotted using matplotlib, showing
normal sensor values and highlighting anomalies in red. This visualization
helps easily identify the points where the sensor deviated from normal
behavior. The plot includes a title, axis labels, and a legend for clarity.
This task teaches how to detect anomalies in time series data using an
unsupervised machine learning model. It also helps develop skills in
creating synthetic datasets and using visualization to interpret machine
learning results.
【Trivia】
Isolation Forest is a tree-based anomaly detection method that scales well
for high-dimensional data. It was specifically designed for anomaly
detection, unlike other machine learning models adapted for this task.
72. Visualizing the Voting Classifier for Customer
Churn Prediction
Importance★★★★★
Difficulty★★★☆☆
A telecommunications company is concerned about customer churn and
wants to predict whether a customer will leave the company based on
various factors like monthly charges, contract type, and payment method.
They’ve decided to use an ensemble learning model, combining different
classifiers using a Voting Classifier.You are tasked with building a Voting
Classifier with three models: Logistic Regression, Random Forest, and
Support Vector Machine (SVM). Use this ensemble model to predict
customer churn. Finally, visualize the decision boundary of the ensemble
model along with each individual classifier.Generate a synthetic dataset
with customer features: 'MonthlyCharges' and 'ContractType'.Implement a
Voting Classifier (with majority voting).Visualize the decision boundaries
of each individual classifier and the ensemble model.
【Data Generation Code Example】
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=500, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1,
random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.datasets import make_classification

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier

from sklearn.svm import SVC

from sklearn.ensemble import VotingClassifier


# Create a synthetic dataset

X, y = make_classification(n_samples=500, n_features=2,
n_informative=2, n_redundant=0, n_clusters_per_class=1,
random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# Initialize individual classifiers

clf1 = LogisticRegression()

clf2 = RandomForestClassifier(n_estimators=100)
clf3 = SVC(probability=True)

# Combine them into a Voting Classifier

voting_clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svc',


clf3)], voting='soft')

# Train all models

clf1.fit(X_train, y_train)

clf2.fit(X_train, y_train)

clf3.fit(X_train, y_train)

voting_clf.fit(X_train, y_train)

# Create a meshgrid for plotting decision boundaries

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1


xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min,
y_max, 0.01))
grid = np.c_[xx.ravel(), yy.ravel()]

# Predict for each model

Z1 = clf1.predict(grid).reshape(xx.shape)

Z = clf2.predict(grid).reshape(xx.shape)

Z3 = clf3.predict(grid).reshape(xx.shape)

Z_voting = voting_clf.predict(grid).reshape(xx.shape)

# Plot decision boundaries

plt.figure(figsize=(14, 10))

# Logistic Regression
plt.subplot(2, 2, 1)

plt.contourf(xx, yy, Z1, alpha=0.8)

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolor='k')

plt.title('Logistic Regression')

# Random Forest

plt.subplot(2, 2, 2)

plt.contourf(xx, yy, Z, alpha=0.8)

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolor='k')

plt.title('Random Forest')

# Support Vector Machine

plt.subplot(2, 2, 3)

plt.contourf(xx, yy, Z3, alpha=0.8)


plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolor='k')

plt.title('SVM')

# Voting Classifier
plt.subplot(2, 2, 4)

plt.contourf(xx, yy, Z_voting, alpha=0.8)

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolor='k')

plt.title('Voting Classifier')

plt.tight_layout()

plt.show()

In this task, you are asked to build a Voting Classifier that combines three
different models: Logistic Regression, Random Forest, and Support Vector
Machine (SVM). The Voting Classifier aggregates the predictions from
these individual classifiers and makes the final prediction based on a
majority vote, or in this case, a soft vote (based on the predicted
probabilities). This allows the strengths of different classifiers to
complement each other and improve prediction performance.
The first step is to generate synthetic data using make_classification. This
function creates a dataset with two features that can be used to separate two
classes. We split this dataset into training and testing sets to train the
models.
After that, we initialize each individual classifier. Logistic Regression is a
linear model that is easy to interpret, while Random Forest is an ensemble
model that uses decision trees. SVM is effective in high-dimensional
spaces, and in this case, we use it with probability estimates for soft voting.
Next, we combine these classifiers into a VotingClassifier. The voting='soft'
parameter means that the final prediction is based on the average of
predicted probabilities from the individual classifiers. This soft voting
approach is often more robust than hard voting, where only the final
predicted labels are used.
To visualize how each classifier makes decisions, we create a mesh grid
over the feature space. For each point in this grid, the classifiers predict
whether it belongs to class 0 or class 1. The results are then plotted to show
the decision boundaries. By comparing the plots, you can see how each
model’s decision-making differs and how the ensemble Voting Classifier
aggregates them.
Finally, the plt.contourf function is used to plot the decision regions, while
the plt.scatter function is used to plot the actual data points. Each subplot
represents the decision boundary of one classifier, allowing you to see how
they each divide the feature space differently and how the Voting Classifier
integrates them into a final decision boundary.
This exercise demonstrates the power of ensemble learning, especially
when different types of models are combined to create a more robust
predictor. The visual comparison of decision boundaries helps you better
understand how each model contributes to the ensemble.
【Trivia】
The Voting Classifier works particularly well when the individual classifiers
are diverse in their approach to learning the data. This diversity helps
ensure that errors made by one model are corrected by others, making the
ensemble model more resilient.
73. Visualizing the Impact of Train-Test Splits in
Machine Learning
Importance★★★★★
Difficulty★★★☆☆
You are a data scientist at a company that develops predictive models for
real estate prices.The company wants to understand how different train-test
splits affect the performance of a linear regression model predicting house
prices.Your task is to visualize the performance of the model using Mean
Squared Error (MSE) as a metric by splitting the data into different training
and testing proportions (80%-20%, 70%-30%, 60%-40%).Generate
synthetic data for housing prices using features like size,
number_of_bedrooms, and age_of_house.Build and evaluate linear
regression models for the different splits, and plot the MSE values for each
split to compare model performance.
【Data Generation Code Example】
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

# # Generating synthetic data for house prices

X = np.array([[np.random.randint(1000, 4000), np.random.randint(1, 5),


np.random.randint(1, 50)] for _ in range(1000)])
y = np.array([50000 + 100 * size + 20000 * bedrooms - 300 * age for
size, bedrooms, age in X])
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

# # Generating synthetic data

X = np.array([[np.random.randint(1000, 4000), np.random.randint(1, 5),


np.random.randint(1, 50)] for _ in range(1000)])
y = np.array([50000 + 100 * size + 20000 * bedrooms - 300 * age for
size, bedrooms, age in X])

# # Different train-test splits to compare


splits = [(0.8, 0.2), (0.7, 0.3), (0.6, 0.4)]

mse_values = []

# # Train models and calculate MSE for each split

for train_size, test_size in splits:

X_train, X_test, y_train, y_test = train_test_split(X, y,


train_size=train_size, test_size=test_size, random_state=42)

model = LinearRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

mse_values.append(mse)

# # Plotting the MSE values

plt.plot(['80%-20%', '70%-30%', '60%-40%'], mse_values, marker='o')


plt.title('MSE for Different Train-Test Splits')

plt.xlabel('Train-Test Split')

plt.ylabel('Mean Squared Error')

plt.show()

In this exercise, you are required to explore how changing the train-test split
ratio affects the performance of a machine learning model.
We use linear regression, a supervised learning algorithm, to predict house
prices based on features like size, the number of bedrooms, and the age of
the house.
First, we generate synthetic data where X contains the features (size,
number of bedrooms, and age of the house), and y contains the house prices
calculated based on a linear equation.
This synthetic data allows us to simulate a real-world scenario of predicting
house prices using basic features.
We then use different proportions for splitting the data into training and
testing sets: 80%-20%, 70%-30%, and 60%-40%.
The goal is to evaluate how the model's performance, measured by the
Mean Squared Error (MSE), changes when the amount of training data is
increased or decreased.
The MSE is used as the performance metric, which measures how well the
model's predictions match the actual values. A lower MSE indicates better
performance.
We train the model on the training set and predict prices on the test set for
each split. We then compute the MSE for each split and store these values.
Finally, we plot the MSE for each of the splits to visualize how model
performance changes with different train-test ratios.
This is a common task when building machine learning models, as the
quality of the training data has a significant impact on the model's accuracy
and generalization ability.
【Trivia】
In real-world machine learning, different train-test splits can have a
significant impact on the model’s performance, especially with smaller
datasets.
While 80%-20% is a common default split, in some domains, such as
finance or medicine, even smaller splits like 90%-10% may be used to
maximize the use of training data.
74. Using Violin Plots to Visualize Categorical Data
in Customer Satisfaction
Importance★★★★☆
Difficulty★★★☆☆
A marketing team for an online retail company wants to analyze customer
satisfaction based on product category. The company has data on customer
satisfaction ratings (on a scale of 1 to 5) for several different product
categories (e.g., electronics, clothing, furniture).Your task is to visualize the
distribution of customer satisfaction ratings for each product category using
a violin plot. This will help the marketing team understand the spread of
ratings and identify categories that may need improvement.Generate a
random dataset with 3 product categories (Electronics, Clothing, Furniture)
and 200 customer ratings for each category. Use this data to create a violin
plot showing the distribution of ratings per category.
【Data Generation Code Example】
import numpy as np

import pandas as pd

categories = ['Electronics', 'Clothing', 'Furniture']

np.random.seed(42)

ratings = np.random.randint(1, 6, 600)

product_categories = np.random.choice(categories, 600)

data = pd.DataFrame({'Category': product_categories, 'Rating': ratings})


【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt


#Create random customer satisfaction ratings and categories

categories = ['Electronics', 'Clothing', 'Furniture']


np.random.seed(42)

ratings = np.random.randint(1, 6, 600)

product_categories = np.random.choice(categories, 600)

data = pd.DataFrame({'Category': product_categories, 'Rating': ratings})

#Create a violin plot to show the distribution of ratings per category

plt.figure(figsize=(8, 6))
sns.violinplot(x='Category', y='Rating', data=data)

plt.title('Distribution of Customer Satisfaction Ratings by Product


Category')

plt.xlabel('Product Category')
plt.ylabel('Customer Satisfaction Rating')

plt.show()

In this task, you are required to visualize customer satisfaction ratings using
a violin plot, which is a combination of a box plot and a kernel density plot.
It shows not only the summary statistics like the median and quartiles but
also the density of the data at various points, making it easier to understand
the spread and skewness of the data.The first step is to generate the dataset.
In this case, random integer values between 1 and 5 (representing customer
satisfaction ratings) are created using np.random.randint. You generate 600
ratings in total, divided randomly among three product categories:
Electronics, Clothing, and Furniture. This step simulates a realistic scenario
where customers provide ratings for various products.Once the data is
prepared, you use the Seaborn library to create the violin plot. Seaborn’s
violinplot function is ideal for showing the distribution of numerical data
across different categories. By setting the x-axis to Category and the y-axis
to Rating, you display how customer satisfaction is distributed across the
three product types. This is useful in machine learning when exploring
feature distributions or evaluating how different categories affect a target
variable.Finally, plt.show() is used to display the plot. The plot gives you
insights into how satisfaction ratings are spread within each product
category, which is valuable for decision-making in customer satisfaction
improvement strategies.
【Trivia】
Violin plots are especially useful in machine learning for feature
engineering and exploratory data analysis (EDA). By visualizing the
distribution of categorical features, you can quickly identify imbalances,
outliers, or clusters that may influence the model’s predictions.
75. Visualizing Feature Selection Process in
Machine Learning
Importance★★★★☆
Difficulty★★★☆☆
You are a data analyst working for a company that provides a health and
fitness tracking app. Your manager has tasked you with improving the
prediction accuracy of an algorithm that estimates a person's health score
based on various factors such as heart rate, step count, and sleep data. Your
goal is to visualize which features in the dataset are most important for
predicting the health score, so the team can optimize data collection for
more accurate predictions.Create a Python script to select the most relevant
features for predicting health score and visualize the results using a feature
importance plot. The dataset should have 10 features and 1 target variable,
with at least 200 records generated randomly.
【Data Generation Code Example】
import numpy as np

import pandas as pd

#Generating a random dataset with 10 features and 1 target

np.random.seed(42)

data = pd.DataFrame(np.random.rand(200, 10), columns=[f'feature_{i}'


for i in range(1, 11)])

data['health_score'] = np.random.randint(0, 100, size=200)


【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

from sklearn.ensemble import RandomForestRegressor


from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

#Generating random dataset with 10 features and 1 target (health_score)

np.random.seed(42)

data = pd.DataFrame(np.random.rand(200, 10), columns=[f'feature_{i}'


for i in range(1, 11)])

data['health_score'] = np.random.randint(0, 100, size=200)


#Splitting data into features (X) and target (y)

X = data.drop(columns=['health_score'])

y = data['health_score']

#Splitting the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

#Creating and training a RandomForestRegressor model

model = RandomForestRegressor(n_estimators=100, random_state=42)

model.fit(X_train, y_train)

#Extracting feature importances from the trained model


importances = model.feature_importances_

feature_names = X.columns

#Visualizing the feature importances using a bar plot


plt.figure(figsize=(10,6))

plt.barh(feature_names, importances, color='skyblue')

plt.xlabel('Importance')

plt.ylabel('Feature')

plt.title('Feature Importance for Predicting Health Score')

plt.show()

This task focuses on visualizing which features (inputs) of the dataset


contribute the most to predicting the target variable, "health score". The
feature selection process helps in improving the model's performance by
identifying the most relevant inputs.The script starts by generating a
random dataset with 10 features and 1 target variable (health score). Each
feature represents data like heart rate, step count, and sleep hours, although
randomly generated here.After generating the dataset, it is split into a
training set and a testing set. The training set is used to train the model, and
the testing set evaluates the performance.We use a RandomForestRegressor,
a type of machine learning model that fits many decision trees and outputs
the average prediction. The RandomForestRegressor is particularly useful
here because it can automatically compute feature importance during
training.The feature importance is obtained using the feature_importances_
attribute of the trained model. This attribute gives a score to each feature,
indicating its contribution to the prediction of the health score.Finally, the
code visualizes the importance of each feature using a horizontal bar plot.
The feature names are displayed along the y-axis, and their respective
importance values are shown along the x-axis. The features with the highest
values are the most critical for predicting health scores, giving the team
actionable insights into which data points to prioritize.
【Trivia】
Random forests are known for being robust and effective even with
minimal tuning of hyperparameters. Their built-in feature importance
evaluation is a great tool for understanding the relevance of various inputs
without the need for complex methods like recursive feature elimination.
76. Visualizing Hyperparameter Optimization for a
Customer Classification Model
Importance★★★★☆
Difficulty★★★☆☆
A retail company wants to classify its customers into two segments based
on their purchasing behavior to offer personalized promotions.You have
built a machine learning model, but to improve accuracy, you need to
optimize the model’s hyperparameters.After optimizing the
hyperparameters using GridSearchCV, visualize how the different
combinations of hyperparameters influence the model’s
performance.Generate some data for customer purchases, fit a
RandomForestClassifier, and visualize the results of the hyperparameter
tuning.You should create a plot showing how accuracy varies across
different values of hyperparameters.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

# Create synthetic data representing customer purchases

X, y = make_classification(n_samples=500, n_features=10,
n_informative=8, n_classes=2, random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score

# # Generate synthetic data representing customer purchases

X, y = make_classification(n_samples=500, n_features=10,
n_informative=8, n_classes=2, random_state=42)
# # Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# # Define the model

model = RandomForestClassifier(random_state=42)

# # Define the hyperparameters grid

param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 20]}

# # Perform grid search with cross-validation

grid_search = GridSearchCV(model, param_grid, cv=5,


scoring='accuracy')

grid_search.fit(X_train, y_train)

# # Extract the results

results = grid_search.cv_results_

# # Extract hyperparameter combinations and their scores

n_estimators_vals = results['param_n_estimators'].data

max_depth_vals = results['param_max_depth'].data

mean_test_scores = results['mean_test_score']

# # Visualize the results

plt.figure(figsize=(10, 6))

plt.scatter(n_estimators_vals, max_depth_vals, c=mean_test_scores,


cmap='viridis')

plt.colorbar(label='Mean Accuracy')
plt.xlabel('Number of Estimators')

plt.ylabel('Max Depth')

plt.title('Hyperparameter Optimization Results')

plt.show()

This code demonstrates the process of tuning hyperparameters for a


RandomForestClassifier using GridSearchCV, a method for automating the
search for optimal model parameters.The synthetic data represents customer
purchasing patterns, generated using the make_classification function. This
simulates a scenario where each customer is described by 10 features, 8 of
which are informative for predicting customer behavior.The dataset is then
split into training and test sets, ensuring that the model is trained and
evaluated on separate data.The RandomForestClassifier is chosen because
it’s well-suited for classification tasks and is robust to overfitting when
tuned correctly.Next, we define a grid of hyperparameters:‣ n_estimators
controls how many trees are in the forest, and‣ max_depth controls how
deep each tree is allowed to grow.The GridSearchCV function runs cross-
validation to evaluate the performance of the model on different
combinations of these hyperparameters.After fitting the model to the data,
the cross-validation results are stored in cv_results_, from which we extract
the values of n_estimators, max_depth, and their corresponding accuracy
scores.The scatter plot shows how accuracy changes with the different
combinations of n_estimators and max_depth, with colors indicating
accuracy. This visual insight helps understand which hyperparameter
combinations lead to better performance, guiding further tuning or
adjustments.By visualizing the effect of hyperparameters, the process of
model optimization becomes more intuitive and accessible. It also
illustrates how small changes in parameters can significantly impact model
accuracy.
【Trivia】
‣ GridSearchCV can be quite time-consuming when there are many
hyperparameters to tune or large datasets. An alternative is
RandomizedSearchCV, which samples a subset of the parameter space
randomly, often yielding good results in a shorter time.‣ Random Forest
models work well for most classification problems because they reduce
variance in predictions through ensemble learning, which aggregates
predictions from multiple decision trees.
77. Visualizing Model Performance Across Data
Subsets in a Sales Prediction Scenario
Importance★★★★☆
Difficulty★★★☆☆
You are a data scientist working for an e-commerce company that wants to
predict sales based on historical data. The company operates in different
regions, and you are tasked with evaluating the performance of a machine
learning model on subsets of data representing different regions.Using a
pre-built dataset containing sales data from different regions, split the data
into subsets based on the 'region' feature. Train a machine learning model
(e.g., linear regression) on the full dataset, and then evaluate the model's
performance on each regional subset using the R² score.Generate a plot to
visualize the model's performance (R² score) for each region. This will help
the company understand how well the model generalizes across different
geographical locations.Create synthetic sales data to simulate this scenario,
ensuring each region has its unique characteristics (i.e., different sales
patterns). Then, complete the task by training the model and visualizing the
results.
【Data Generation Code Example】
import numpy as np

import pandas as pd

#Generate random regions

regions = np.random.choice(['North', 'South', 'East', 'West'], 1000)

#Generate synthetic sales data with different trends for each region

sales = np.array([500 + np.random.normal(50, 20) if region == 'North'

else 400 + np.random.normal(50, 25) if region == 'South'

else 600 + np.random.normal(60, 15) if region == 'East'


else 550 + np.random.normal(55, 30) for region in
regions])

#Create the dataset with features and target

data = pd.DataFrame({'region': regions, 'sales': sales, 'marketing_spend':


np.random.normal(200, 30, 1000)})

data['week'] = np.random.randint(1, 52, 1000)


【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score

#Data creation (synthetic)

regions = np.random.choice(['North', 'South', 'East', 'West'], 1000)

sales = np.array([500 + np.random.normal(50, 20) if region == 'North'


else 400 + np.random.normal(50, 25) if region == 'South'

else 600 + np.random.normal(60, 15) if region == 'East'

else 550 + np.random.normal(55, 30) for region in


regions])

data = pd.DataFrame({'region': regions, 'sales': sales, 'marketing_spend':


np.random.normal(200, 30, 1000)})

data['week'] = np.random.randint(1, 52, 1000)

#Preparing data

X = data[['marketing_spend', 'week']]

y = data['sales']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

#Model training

model = LinearRegression()

model.fit(X_train, y_train)

#Evaluation on full dataset

y_pred = model.predict(X_test)

full_data_r2 = r2_score(y_test, y_pred)

#Evaluating performance on subsets by region

regions_unique = data['region'].unique()

performance_by_region = {}

for region in regions_unique:


subset = data[data['region'] == region]

X_subset = subset[['marketing_spend', 'week']]

y_subset = subset['sales']

y_pred_subset = model.predict(X_subset)

performance_by_region[region] = r2_score(y_subset, y_pred_subset)

#Visualizing performance

regions_list = list(performance_by_region.keys())

r2_scores = list(performance_by_region.values())

plt.figure(figsize=(10, 6))
plt.bar(regions_list, r2_scores)

plt.title('Model R² Score by Region')

plt.xlabel('Region')

plt.ylabel('R² Score')

plt.show()

In this exercise, we are simulating a real-world scenario where an e-


commerce company wants to understand how well a machine learning
model performs in predicting sales for different regions. The model's
performance is visualized by measuring the R² score for each regional
subset.First, we create synthetic data representing sales across different
regions using a mixture of random values and predefined sales patterns for
each region. The 'sales' column serves as the target variable, while
'marketing_spend' and 'week' act as the feature variables.The dataset is split
into training and test sets, and a Linear Regression model is trained on the
entire dataset to predict sales. After training the model, we evaluate its
performance on each subset of the data (i.e., each region) by calculating the
R² score. This metric indicates how well the model's predictions match the
actual sales for each region.Next, a bar plot is generated to show the R²
scores across regions, giving us a clear visualization of the model's
generalization ability across geographical locations. The visualization helps
identify regions where the model might underperform and where further
tuning or additional data might be needed.Finally, the use of R² scores
allows us to quantify the model's performance in a standardized manner,
which is easy to interpret for business stakeholders. This method ensures
that the model performs consistently across various data subsets and
provides insights into potential improvements needed for specific regions.
【Trivia】
The R² score is a widely used metric in regression tasks. It measures the
proportion of the variance in the dependent variable (sales, in this case) that
is predictable from the independent variables (features like marketing spend
and week). An R² score of 1 indicates a perfect fit, while a score of 0 means
the model does not explain any of the variance.
78. Visualizing the Impact of Neural Network
Layers on Model Accuracy
Importance★★★★★
Difficulty★★★☆☆
You are working for a retail company that is trying to predict customer
purchasing behavior based on various inputs such as age, income, and
shopping habits.Your task is to visualize how adding layers to a simple
neural network affects its performance in predicting customer purchases
(binary classification: purchase or no purchase).Generate a synthetic dataset
that simulates customer features and use a neural network to predict
whether the customer will make a purchase.Then, create a visualization to
show how the number of hidden layers affects the model's accuracy.
【Data Generation Code Example】
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split


from sklearn.preprocessing import StandardScaler

# # Generate synthetic customer data

data = np.random.rand(1000, 3) * [100, 200000, 1] # Age, Income,


Shopping Habits

labels = (data[:, 1] > 100000).astype(int) # Customers with income over


100,000 are more likely to purchase

# # Create a DataFrame for easier manipulation

df = pd.DataFrame(data, columns=['Age', 'Income', 'Shopping_Habits'])

df['Purchase'] = labels

# # Split into training and test data


X_train, X_test, y_train, y_test = train_test_split(df[['Age', 'Income',
'Shopping_Habits']], df['Purchase'], test_size=0.3, random_state=42)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

import matplotlib.pyplot as plt


# # Generate synthetic customer data

data = np.random.rand(1000, 3) * [100, 200000, 1] # Age, Income,


Shopping Habits

labels = (data[:, 1] > 100000).astype(int) # Customers with income over


100,000 are more likely to purchase

# # Create a DataFrame for easier manipulation

df = pd.DataFrame(data, columns=['Age', 'Income', 'Shopping_Habits'])

df['Purchase'] = labels

# # Split into training and test data

X_train, X_test, y_train, y_test = train_test_split(df[['Age', 'Income',


'Shopping_Habits']], df['Purchase'], test_size=0.3, random_state=42)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

# # Function to build and evaluate neural network with variable layers

def build_and_evaluate_model(layers):

model = Sequential([Dense(16, input_dim=3, activation='relu')] +


[Dense(16, activation='relu') for _ in range(layers)] + [Dense(1,
activation='sigmoid')])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=


['accuracy'])

model.fit(X_train_scaled, y_train, epochs=10, batch_size=10,


verbose=0)

loss, accuracy = model.evaluate(X_test_scaled, y_test, verbose=0)


return accuracy

# # Test models with different numbers of hidden layers

layers_range = list(range(1, 6))

accuracies = [build_and_evaluate_model(layers) for layers in


layers_range]

# # Plot the results

plt.plot(layers_range, accuracies)

plt.title('Impact of Hidden Layers on Model Accuracy')

plt.xlabel('Number of Hidden Layers')

plt.ylabel('Accuracy')

plt.grid(True)

plt.show()

In this exercise, you are building a neural network using the Sequential API
from the tensorflow.keras library.
The goal is to understand how adding more hidden layers to the model
impacts its accuracy in predicting whether a customer will make a purchase.
First, we generate synthetic data to simulate customers with three features:
age, income, and shopping habits.
We assume customers with higher income are more likely to purchase, so
the purchase label is based on whether the income is over 100,000.
This data is then split into training and test sets.
We scale the features using StandardScaler to improve model performance
by ensuring all inputs are on the same scale.
The neural network is created with a flexible number of hidden layers,
controlled by the build_and_evaluate_model function.
Each layer uses the relu activation function, which introduces non-linearity
and helps the network learn complex patterns.
The output layer has one neuron and uses the sigmoid activation function,
suitable for binary classification tasks.
The model is trained using binary crossentropy as the loss function and the
Adam optimizer.
After training, we evaluate the model's performance using accuracy as a
metric.
Finally, the script generates a plot to show the relationship between the
number of hidden layers and model accuracy.
This allows you to visualize how model complexity (in terms of hidden
layers) influences predictive performance.
【Trivia】
Did you know? The relu activation function is often preferred in neural
networks because it reduces the likelihood of vanishing gradients, a
common issue in deep learning. This helps the model train faster and
achieve better performance in many cases.
79. Visualizing the Gradient Boosting Process for
Customer Churn Prediction
Importance★★★★★
Difficulty★★★☆☆
A telecom company wants to predict customer churn, as retaining
customers is critical for its business success. The dataset includes customer
information such as monthly charges, contract duration, and service usage
details. Your task is to train a Gradient Boosting model to predict whether a
customer is likely to churn or not based on this dataset. After training the
model, visualize the contribution of each feature (importance) in the
decision-making process by generating a plot of feature
importances.Implement the following steps:Generate a synthetic dataset to
simulate customer data (features such as monthly charges, contract
duration, etc.).Train a Gradient Boosting Classifier to predict churn.Plot a
feature importance graph to visualize which features contribute most to the
prediction.
【Data Generation Code Example】
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.ensemble import GradientBoostingClassifier

# Create a synthetic dataset with random values

data = pd.DataFrame({
'MonthlyCharges': np.random.randint(20, 100, 1000),

'ContractDuration': np.random.randint(1, 36, 1000),

'ServiceUsage': np.random.randint(0, 500, 1000),


'CustomerAge': np.random.randint(18, 80, 1000),

'Churn': np.random.choice([0, 1], 1000)

})

X = data.drop('Churn', axis=1)

y = data['Churn']

# Split the data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split

from sklearn.ensemble import GradientBoostingClassifier

# Create a synthetic dataset with random values

data = pd.DataFrame({

'MonthlyCharges': np.random.randint(20, 100, 1000),


'ContractDuration': np.random.randint(1, 36, 1000),

'ServiceUsage': np.random.randint(0, 500, 1000),

'CustomerAge': np.random.randint(18, 80, 1000),

'Churn': np.random.choice([0, 1], 1000)

})

X = data.drop('Churn', axis=1)

y = data['Churn']

# Split the data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)
# Train a Gradient Boosting Classifier

gb_model = GradientBoostingClassifier(n_estimators=100,
random_state=42)

gb_model.fit(X_train, y_train)

# Get feature importances

feature_importances = gb_model.feature_importances_

# Plot the feature importances

features = X.columns

plt.figure(figsize=(8, 6))

plt.barh(features, feature_importances, color='blue')

plt.xlabel('Importance')

plt.ylabel('Feature')
plt.title('Feature Importance in Gradient Boosting Model')

plt.show()

In this exercise, we aim to visualize how the Gradient Boosting process


works in predicting customer churn. Gradient Boosting is a powerful
ensemble learning technique that builds models sequentially, where each
model tries to correct the errors of the previous ones. In our case, we will
train a Gradient Boosting Classifier to predict whether a customer will
churn based on several features like monthly charges, contract duration,
service usage, and customer age.We start by creating a synthetic dataset
with random data. The dataset includes important features for telecom
companies, such as customer spending habits and service usage. This
dataset has both features (independent variables) and a target (dependent
variable), which is whether a customer churned or not.Once the data is
generated, we split it into training and testing sets to avoid overfitting and
ensure the model performs well on unseen data. The
GradientBoostingClassifier is then trained on the training set. This classifier
will fit multiple decision trees sequentially, learning from the mistakes
made by the previous trees to improve the model's accuracy.After training
the model, we can extract the feature importances, which are a key part of
Gradient Boosting models. These importances tell us how much each
feature contributes to the prediction. We visualize these contributions by
creating a bar plot of the feature importances. This helps us understand
which features are most influential in predicting customer churn. In this
example, we used matplotlib to plot the feature importances, labeling the
axes and setting a title for clarity.Understanding feature importances is
crucial because it allows businesses to focus on the key drivers of customer
behavior, such as whether contract duration or service usage is a bigger
factor in customer churn. This information can lead to more targeted
strategies for reducing churn.
【Trivia】
Gradient Boosting was popularized by the development of the XGBoost
algorithm, which is widely used in machine learning competitions for its
speed and accuracy.
80. Visualizing the Outputs of a Simple Recurrent
Neural Network (RNN) for Time-Series Forecasting
Importance★★★★☆
Difficulty★★★☆☆
You are working for a company that predicts the demand for products using
time-series data. The company has collected daily sales data for the past
two years and wants to forecast the next 30 days' sales using a Recurrent
Neural Network (RNN). Your task is to create a simple RNN model to
process this time-series data, train the model, and visualize the predicted
outputs over time.You will generate the input data for this task. First, create
synthetic time-series data that simulates product demand. Then, implement
an RNN model to forecast the next 30 days based on the previous 100 days'
sales. Finally, plot the actual and predicted sales values on a graph.The goal
is to train a model, generate predictions, and visualize the results. Your
solution should use Python and must include code to generate the graph.
【Data Generation Code Example】
import numpy as np

import matplotlib.pyplot as plt

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import SimpleRNN, Dense

## Generate synthetic time-series data

data = np.sin(np.arange(0, 500) * 0.1) + np.random.normal(0, 0.1, 500)


## Prepare input sequences and targets for RNN

X = np.array([data[i:i+100] for i in range(len(data) - 100)])

y = np.array([data[i+100] for i in range(len(data) - 100)])


【Diagram Answer】

【Code Answer】
import numpy as np
import matplotlib.pyplot as plt

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import SimpleRNN, Dense

## Generate synthetic time-series data

data = np.sin(np.arange(0, 500) * 0.1) + np.random.normal(0, 0.1, 500)

## Prepare input sequences and targets for RNN


X = np.array([data[i:i+100] for i in range(len(data) - 100)])

y = np.array([data[i+100] for i in range(len(data) - 100)])


## Reshape input for RNN (samples, time steps, features)

X = X.reshape((X.shape[0], X.shape[1], 1))

## Build a simple RNN model

model = Sequential([SimpleRNN(50, activation='tanh', input_shape=


(100, 1)), Dense(1)])

model.compile(optimizer='adam', loss='mse')

## Train the model

model.fit(X, y, epochs=20, verbose=0)

## Predict the next 30 days

preds = []
input_seq = data[-100:]

for _ in range(30):

input_seq = input_seq.reshape((1, 100, 1))


pred = model.predict(input_seq, verbose=0)

preds.append(pred[0][0])

input_seq = np.append(input_seq[0], pred[0][0])[1:]

## Plot the results

plt.figure(figsize=(10, 6))

plt.plot(range(500), data, label="Actual Sales")

plt.plot(range(500, 530), preds, label="Predicted Sales")

plt.title("Actual vs Predicted Sales Forecast")


plt.xlabel("Days")

plt.ylabel("Sales")

plt.legend()

plt.show()

In this exercise, you are asked to use a Recurrent Neural Network (RNN) to
predict future values in a time-series dataset. The task begins by generating
synthetic data that mimics real-world product sales over a period of 500
days. We add a sine wave and random noise to simulate the fluctuations and
unpredictability seen in real sales data.To prepare the data for the RNN, we
need to break it down into sequences. For each sequence of 100 past days,
we will train the model to predict the sales value for the next day. This kind
of setup, where previous data is used to predict the future, is common in
time-series forecasting problems.The RNN model is built using the
Sequential API from Keras, with a SimpleRNN layer and a Dense layer.
The RNN layer has 50 units and uses the tanh activation function, which is
commonly used for recurrent layers due to its ability to handle time
dependencies. The final layer is a Dense layer with a single neuron, which
outputs the predicted sales value for the next day.We train the model using
the Adam optimizer and mean squared error (MSE) as the loss function.
After training, the model is used to predict sales for the next 30 days, using
the last 100 days' data as input. This process is repeated iteratively: after
each prediction, the predicted value is added to the input sequence, and the
oldest value is dropped, allowing us to make consecutive forecasts.Finally,
we visualize the results by plotting the actual sales values alongside the
predicted values for the next 30 days. The plot helps us assess how well the
model is forecasting the future sales and allows us to visually compare the
model's performance. This visualization is crucial for understanding the
model's strengths and weaknesses when applied to time-series forecasting
tasks.
【Trivia】
RNNs are particularly suited for time-series data because they retain
information from previous time steps, allowing them to capture temporal
dependencies. However, in practice, they are often replaced by more
advanced models like LSTMs (Long Short-Term Memory) or GRUs (Gated
Recurrent Units), which address the problem of vanishing gradients and can
remember dependencies over longer time periods.
81. Visualizing Data Transformation in Machine
Learning
Importance★★★★☆
Difficulty★★★☆☆
A retail company wants to analyze the sales performance of its various
stores based on monthly sales data. The goal is to identify any trends or
patterns that could inform their decision-making for optimizing inventory
levels and marketing efforts. You are asked to process the sales data for five
stores over a period of 12 months, visualize the transformed data, and
explain how data transformation improves the model training process.Your
task is to:Create a dataset that simulates monthly sales data for five
stores.Normalize this data using Min-Max scaling to make it suitable for
machine learning models.Visualize the original data and the normalized
data side-by-side using line plots to show how data transformation impacts
the scale of the features.
【Data Generation Code Example】
import numpy as np

import pandas as pd

np.random.seed(42)

#Generate random sales data for 5 stores over 12 months

data = {f'Store_{i}': np.random.randint(200, 2000, 12) for i in range(1,


6)}

months = pd.date_range('2023-01-01', periods=12, freq='M')


sales_data = pd.DataFrame(data, index=months)
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler

np.random.seed(42)
#Generate random sales data for 5 stores over 12 months

data = {f'Store_{i}': np.random.randint(200, 2000, 12) for i in range(1,


6)}

months = pd.date_range('2023-01-01', periods=12, freq='M')

sales_data = pd.DataFrame(data, index=months)


#Normalize data using Min-Max Scaler

scaler = MinMaxScaler()

normalized_data = pd.DataFrame(scaler.fit_transform(sales_data),
index=months, columns=sales_data.columns)

#Plot original and normalized data

plt.figure(figsize=(10, 6))

plt.subplot(1, 2, 1)

plt.plot(sales_data.index, sales_data)

plt.title('Original Sales Data')

plt.xlabel('Month')
plt.ylabel('Sales')

plt.legend(sales_data.columns, loc='upper right')

plt.subplot(1, 2, 2)
plt.plot(normalized_data.index, normalized_data)

plt.title('Normalized Sales Data')

plt.xlabel('Month')

plt.ylabel('Normalized Sales')

plt.legend(normalized_data.columns, loc='upper right')

plt.tight_layout()

plt.show()
In machine learning, data preprocessing is a critical step that directly affects
the performance of models. One common preprocessing technique is data
normalization. In this exercise, we used Min-Max scaling to normalize sales
data for multiple stores. Min-Max scaling transforms the data to a fixed
range, typically between 0 and 1, by subtracting the minimum value of a
feature and dividing by the range (max - min). This ensures that all features
are on the same scale, which is important for algorithms like gradient
descent-based methods, where large feature values could dominate smaller
ones, leading to biased results.First, we generated a dataset that simulated
the sales figures for five stores across 12 months. The raw data varied
widely in scale, which can be problematic for machine learning models. To
address this, we applied Min-Max scaling, which rescaled all values to a
range between 0 and 1.Visualizing both the original and normalized data
helps us understand how scaling affects the distribution and scale of the
data. In the plot of the original data, the sales values differ significantly
between stores, which could introduce bias during model training. In
contrast, the normalized data shows all stores having values within the same
range, making it easier for models to learn without any store having undue
influence on the outcome.Data normalization is particularly useful when
using algorithms that rely on distance metrics (such as K-nearest neighbors
or support vector machines) or optimization algorithms like gradient
descent. These models perform better when the input features are scaled
consistently. Normalization also helps prevent numerical instability and can
speed up the convergence of models.In conclusion, transforming data
through techniques like Min-Max scaling helps ensure that machine
learning models can effectively process and learn from features that might
originally be on vastly different scales.
【Trivia】
Did you know that normalization is not always necessary? Some machine
learning algorithms, like decision trees or random forests, are not sensitive
to feature scaling because they make decisions based on feature thresholds
rather than distances. However, for models relying on distance or
optimization, normalization can be crucial!
82. Visualizing the Impact of Feature Scaling in
Machine Learning
Importance★★★★★
Difficulty★★★☆☆
A financial institution has a customer dataset containing various features
such as income, loan amount, and credit score.Your task is to build a
machine learning model that predicts whether a customer will default on a
loan based on these features.Before building the model, you need to
visualize how scaling affects the features and the model performance.Create
a synthetic dataset with features: income, loan amount, credit score, and a
binary target variable (loan default: 0 or 1).Plot the features using scatter
plots before and after scaling.Use StandardScaler for scaling.Demonstrate
the importance of feature scaling by observing how different scales impact
the model's accuracy.
【Data Generation Code Example】
import numpy as np

import pandas as pd

np.random.seed(42)

n_samples = 100

data = {

'income': np.random.normal(50000, 15000, n_samples),

'loan_amount': np.random.normal(30000, 10000, n_samples),

'credit_score': np.random.normal(700, 100, n_samples),

'default': np.random.choice([0, 1], size=n_samples)

df = pd.DataFrame(data)
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

np.random.seed(42)

data = {

'income': np.random.normal(50000, 15000, 100),


'loan_amount': np.random.normal(30000, 10000, 100),

'credit_score': np.random.normal(700, 100, 100),

'default': np.random.choice([0, 1], size=100)

}
df = pd.DataFrame(data)

# Create the feature matrix (X) and target vector (y)

X = df[['income', 'loan_amount', 'credit_score']]

y = df['default']

# Plot the data before scaling

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)

plt.scatter(X['income'], X['loan_amount'], c=y)

plt.title('Before Scaling')

plt.xlabel('Income')

plt.ylabel('Loan Amount')

# Scale the features


scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

X_scaled = pd.DataFrame(X_scaled, columns=['income', 'loan_amount',


'credit_score'])

# Plot the data after scaling

plt.subplot(1, 2, 2)

plt.scatter(X_scaled['income'], X_scaled['loan_amount'], c=y)

plt.title('After Scaling')

plt.xlabel('Income (scaled)')
plt.ylabel('Loan Amount (scaled)')

plt.tight_layout()

plt.show()

In this exercise, the goal is to demonstrate how scaling numerical features


can impact machine learning models.In machine learning, features often
have different ranges, which can affect model performance, especially for
algorithms sensitive to feature magnitude, such as gradient descent-based
models.We begin by creating a synthetic dataset that simulates customer
income, loan amount, credit score, and a binary target variable representing
whether the customer defaulted on a loan.Before applying any scaling, we
plot a scatter plot of the income and loan amount.This plot shows the raw,
unscaled data.Next, we use StandardScaler from the sklearn library to scale
the features.The StandardScaler standardizes features by removing the
mean and scaling them to unit variance.After scaling, we generate a second
scatter plot to visualize the transformed data.We observe that scaling
compresses or expands the range of features, making them comparable.The
result clearly shows that, while the relationships between the features
remain unchanged, the numerical ranges are now standardized.This is
critical for machine learning algorithms, as it helps them converge faster
and perform better, especially with distance-based models.
【Trivia】
Feature scaling is not always necessary for all machine learning models.For
instance, tree-based models like decision trees, random forests, and gradient
boosting are less sensitive to feature scaling because they split data based
on the feature's value at each node, regardless of its scale.However,
algorithms like support vector machines, k-nearest neighbors, and neural
networks often benefit significantly from scaled data.
83. Visualizing Cross-Validation Score Distributions
in Machine Learning Models
Importance★★★★☆
Difficulty★★★☆☆
You work as a data scientist at a company that is developing a machine
learning model to predict customer churn.Your task is to use cross-
validation to evaluate the performance of different machine learning
models.To better communicate results with your stakeholders, you want to
visualize the distribution of cross-validation scores for each model.The goal
is to demonstrate how consistent (or inconsistent) the models are by
visualizing the range of their performance.Generate random data for this
task, and then compare the performance of a Decision Tree and a Random
Forest model using cross-validation.Plot the cross-validation score
distributions of both models in a single graph to visually compare their
performance.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20,
n_informative=15, random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

# Generate synthetic data

X, y = make_classification(n_samples=1000, n_features=20,
n_informative=15, random_state=42)
# Initialize models

models = {'Decision Tree': DecisionTreeClassifier(random_state=42),

'Random Forest': RandomForestClassifier(random_state=42)}

# Perform cross-validation for each model and store the scores

cv_scores = {name: cross_val_score(model, X, y, cv=10) for name, model


in models.items()}

# Plot cross-validation score distributions

plt.figure(figsize=(10, 6))

for name, scores in cv_scores.items():

plt.hist(scores, bins=10, alpha=0.5, label=name)


plt.title('Cross-Validation Score Distributions')

plt.xlabel('Accuracy')

plt.ylabel('Frequency')
plt.legend(loc='best')

plt.show()

In this exercise, we aim to visualize the distribution of cross-validation


scores for two machine learning models: a Decision Tree and a Random
Forest.First, we generate a synthetic dataset using make_classification. This
function creates a classification problem with 1,000 samples and 20
features, where 15 features are informative. This is a typical setup for
evaluating model performance.We then define two machine learning
models: a Decision Tree (DecisionTreeClassifier) and a Random Forest
(RandomForestClassifier). These are popular models for classification
tasks.Cross-validation is performed using the cross_val_score function,
which evaluates each model's performance across 10 folds. The cross-
validation process divides the dataset into 10 subsets, trains the model on 9
of them, and tests it on the remaining 1. This is repeated for all 10 subsets,
and the accuracy scores for each fold are recorded.The main task is to plot
the distribution of cross-validation scores for each model. The hist function
from matplotlib is used to create a histogram for each model's cross-
validation scores. We set alpha=0.5 to make the histograms semi-
transparent, allowing us to compare the distributions of both models on the
same graph. The legend function is used to label the models in the plot.This
visualization provides insight into how the models perform over different
folds of the data. For example, a model with a tighter distribution suggests
more consistent performance, while a model with a wider spread may
indicate performance variability.
【Trivia】
Did you know that cross-validation is not just for measuring accuracy?It
can also be used to estimate various metrics, like precision, recall, and F1-
score, especially when dealing with imbalanced datasets.
84. Visualizing Algorithmic Complexity in Machine
Learning
Importance★★★★☆
Difficulty★★★☆☆
A company is trying to optimize its sales predictions based on historical
sales data. The company’s concern is not only accuracy but also the
computational complexity of the algorithm used for prediction as their
dataset grows larger.Your task is to analyze the computational complexity
of three different machine learning algorithms using a generated
dataset.You must train three models: a linear regression model, a decision
tree, and a k-nearest neighbors (KNN) algorithm.Visualize the time taken
for training these models on datasets of increasing size, and plot the
results.The company is interested in knowing how the time scales with
increasing dataset size for each algorithm. Use Python and libraries like
scikit-learn and matplotlib to generate your visualizations.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_regression

from time import time

## Generate random datasets of increasing sizes

dataset_sizes = [100, 500, 1000, 5000, 10000]

datasets = [make_regression(n_samples=size, n_features=20, noise=0.1)


for size in dataset_sizes]
【Diagram Answer】

【Code Answer】
import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn.tree import DecisionTreeRegressor

from sklearn.neighbors import KNeighborsRegressor

from sklearn.datasets import make_regression

from time import time

import matplotlib.pyplot as plt


## Dataset generation

dataset_sizes = [100, 500, 1000, 5000, 10000]

datasets = [make_regression(n_samples=size, n_features=20, noise=0.1)


for size in dataset_sizes]

## Algorithms initialization

models = {

'Linear Regression': LinearRegression(),

'Decision Tree': DecisionTreeRegressor(),

'K-Nearest Neighbors': KNeighborsRegressor()

}
## Store training times

training_times = {model_name: [] for model_name in models.keys()}

## Train each model on datasets of increasing size and record training


time

for model_name, model in models.items():

for X, y in datasets:

start_time = time()

model.fit(X, y)

training_times[model_name].append(time() - start_time)

## Plot the results

for model_name, times in training_times.items():

plt.plot(dataset_sizes, times, label=model_name)


plt.xlabel('Dataset Size')

plt.ylabel('Training Time (seconds)')

plt.title('Training Time vs Dataset Size for Different Algorithms')

plt.legend()

plt.show()

This exercise focuses on comparing the training times of three machine


learning models—linear regression, decision tree, and k-nearest neighbors
(KNN)—as the size of the dataset increases.The problem involves
generating synthetic datasets of increasing sizes using make_regression.
This function produces regression data with noise and a specified number of
features.You first initialize three different algorithms: LinearRegression for
linear models, DecisionTreeRegressor for tree-based models, and
KNeighborsRegressor for the KNN algorithm. These are the models whose
computational complexity will be analyzed.The code iterates through the
generated datasets and fits each model to each dataset. During training, the
code records the time taken to fit the model using the time() function from
the time module. This helps in understanding how the algorithm's
complexity increases with dataset size.Finally, the results are visualized
using matplotlib. The plt.plot() function creates a line graph for each model,
plotting dataset size against the time taken to train the model. This allows
for a clear comparison of how different algorithms scale with dataset
size.For example, linear models tend to be more efficient with larger
datasets, whereas decision trees and KNN may take longer, especially as the
dataset grows. This exercise allows you to understand the trade-offs
between complexity and performance in machine learning algorithms.
【Trivia】
The complexity of the algorithms in this example varies:‣ Linear regression
has a time complexity of O(n), which makes it fast and scalable.‣ Decision
trees have a complexity of O(n log n) depending on tree depth, so they scale
moderately well but can get slower with large datasets.‣ K-nearest
neighbors has a complexity of O(n^2), as it needs to compute distances
between all points, making it less efficient for large datasets.
85. Visualizing the Impact of Hyperparameter
Tuning on Model Accuracy
Importance★★★★☆
Difficulty★★★☆☆
A retail company is developing a machine learning model to predict
customer churn based on historical customer data.The goal is to tune
hyperparameters of a decision tree classifier to find the optimal
configuration that maximizes model accuracy.You need to visualize the
relationship between different values of the hyperparameters max_depth
and min_samples_split, and the model's performance (accuracy) on a test
dataset.The company has provided a sample dataset, but for this exercise,
create your own data.Use this data to train and validate the decision tree
model, then generate a heatmap to show how different combinations of
hyperparameters impact accuracy.
【Data Generation Code Example】
import numpy as np

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=10,
n_informative=5, n_classes=2)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


【Diagram Answer】

【Code Answer】
import numpy as np
from sklearn.datasets import make_classification

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split, GridSearchCV

import matplotlib.pyplot as plt

import seaborn as sns


# Creating the sample data

X, y = make_classification(n_samples=1000, n_features=10,
n_informative=5, n_classes=2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Defining the model and the grid of hyperparameters to search over

param_grid = {'max_depth': [3, 5, 7, 10], 'min_samples_split': [2, 5, 10]}

clf = DecisionTreeClassifier()

grid_search = GridSearchCV(clf, param_grid, cv=5)

grid_search.fit(X_train, y_train)

# Extracting the grid results


results = grid_search.cv_results_

scores = results['mean_test_score'].reshape(len(param_grid['max_depth']),
len(param_grid['min_samples_split']))

# Creating a heatmap of accuracy based on hyperparameter combinations

plt.figure(figsize=(10, 6))

sns.heatmap(scores, annot=True,
xticklabels=param_grid['min_samples_split'],
yticklabels=param_grid['max_depth'])

plt.title('Hyperparameter Tuning - Accuracy Heatmap')

plt.xlabel('min_samples_split')

plt.ylabel('max_depth')

plt.show()

In this exercise, you will visualize how different hyperparameters of a


Decision Tree classifier impact model performance, specifically accuracy.
We begin by generating synthetic data using the make_classification
function, which simulates a binary classification problem. The data is split
into training and testing sets using train_test_split. This allows us to test
how well the model generalizes on unseen data.
The core of this task is tuning the hyperparameters max_depth (the
maximum depth of the tree) and min_samples_split (the minimum number
of samples required to split a node). These hyperparameters affect the
complexity and performance of the Decision Tree. We use GridSearchCV,
which automates the search across different values of these
hyperparameters and evaluates the model performance using cross-
validation.
After training, we extract the mean test accuracy scores for each
hyperparameter combination and reshape them into a format suitable for
visualization. We then plot these results using a heatmap (via seaborn),
where the x-axis represents min_samples_split, the y-axis represents
max_depth, and the color intensity represents accuracy. The heatmap
provides a clear view of how different hyperparameter values influence
model accuracy, helping you find the best configuration.
By visualizing the hyperparameters in this way, you can quickly identify
which combinations result in better performance, improving the
effectiveness of your model tuning.
【Trivia】
Did you know that hyperparameter tuning can significantly improve model
performance but also increase computational costs? Grid search, used here,
is exhaustive but computationally expensive, which is why techniques like
Randomized Search or Bayesian Optimization are often preferred in
practice for large-scale datasets!
86. Visualizing a Decision Tree for Customer Churn
Prediction
Importance★★★★☆
Difficulty★★★☆☆
You are a data scientist working for a telecommunications company. The
company is facing a problem with customer churn (customers leaving the
service). Your task is to build a Decision Tree model using machine
learning to predict whether a customer is likely to churn based on their
features such as monthly charges, contract type, and tenure.Once the model
is trained, you need to visualize the structure of the decision tree to help the
business team understand how the decisions are made.The dataset is not
provided, so you will first need to create a simple dataset with features like
tenure, monthly_charges, and contract_type, and whether the customer has
churned (churn). After training the decision tree, plot the tree structure to
visualize how decisions are made.Use Python to create the dataset, train the
model, and visualize the tree structure.
【Data Generation Code Example】
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

#### Create a simple dataset for customer churn prediction

df = pd.DataFrame({

'tenure': [1, 3, 4, 10, 12, 24, 36, 48, 60],


'monthly_charges': [20, 30, 40, 50, 60, 70, 80, 90, 100],

'contract_type': [0, 0, 1, 1, 0, 1, 0, 1, 1],

'churn': [1, 0, 0, 1, 0, 1, 0, 1, 0]
})

#### Split the data into training and testing sets

X = df[['tenure', 'monthly_charges', 'contract_type']]

y = df['churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)
【Diagram Answer】

【Code Answer】
import pandas as pd

from sklearn.model_selection import train_test_split


from sklearn.tree import DecisionTreeClassifier

from sklearn.tree import plot_tree

import matplotlib.pyplot as plt

df = pd.DataFrame({
'tenure': [1, 3, 4, 10, 12, 24, 36, 48, 60],

'monthly_charges': [20, 30, 40, 50, 60, 70, 80, 90, 100],

'contract_type': [0, 0, 1, 1, 0, 1, 0, 1, 1],

'churn': [1, 0, 0, 1, 0, 1, 0, 1, 0]

})

X = df[['tenure', 'monthly_charges', 'contract_type']]

y = df['churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

#### Train the Decision Tree Classifier


clf = DecisionTreeClassifier(random_state=42)

clf.fit(X_train, y_train)

#### Visualize the decision tree


plt.figure(figsize=(10, 8))

plot_tree(clf, filled=True, feature_names=['tenure', 'monthly_charges',


'contract_type'], class_names=['No Churn', 'Churn'])

plt.title('Decision Tree for Customer Churn Prediction')

plt.show()

In this exercise, we aim to solve a customer churn prediction problem using


a Decision Tree model.
First, we generate a simple dataset representing customer attributes such as
tenure (how long the customer has been with the company),
monthly_charges (how much they are paying monthly), and contract_type
(whether they are on a month-to-month contract or a long-term contract).
The target label is whether or not the customer has churned (churn), which
is a binary value (1 for churn, 0 for no churn).
Next, we split the dataset into training and testing sets using train_test_split
to ensure that the model is trained on one portion of the data and evaluated
on another. This is important to avoid overfitting and to ensure the model
generalizes well to new data.
We then train a Decision Tree Classifier, which works by recursively
splitting the dataset into smaller subsets based on the feature values. This
process forms a tree-like structure where each internal node represents a
decision on a feature, and each leaf node represents a prediction (either
churn or no churn). Decision Trees are intuitive and easy to interpret,
making them a popular choice in business settings.
The final step is to visualize the tree using plot_tree. This function from
sklearn.tree allows us to create a graphical representation of the decision
tree. The visualization shows how the model makes decisions based on the
features and helps stakeholders understand the decision-making process.
Each decision node in the tree splits based on a threshold in one of the
features, and the leaves represent the final classification (whether the
customer will churn or not). The colors indicate which class dominates the
leaf (churn or no churn).
This visualization is an essential part of the process because it helps non-
technical team members understand the model's logic and provides
transparency in how decisions are made.
【Trivia】
Decision Trees are prone to overfitting, especially when they are deep and
contain many branches. To prevent this, you can prune the tree, which
involves trimming parts of the tree that do not provide significant predictive
power.
87. Visualizing Forecasting Errors Using Machine
Learning
Importance★★★★☆
Difficulty★★★☆☆
You are tasked with predicting the sales of a retail store over time. The
client expects a machine learning model that not only forecasts future sales
but also visualizes the accuracy of these predictions. Your objective is to
build a simple time series forecasting model, evaluate its performance, and
create a visualization showing the prediction errors.The sales data will be
artificially generated, and you are expected to visualize the errors between
actual sales and predicted sales. The model will be trained using a linear
regression algorithm, and the errors should be plotted over time.Your goal is
to:Generate a time series dataset with random sales data.Train a linear
regression model on this data.Predict future sales.Plot the actual sales vs.
predicted sales, and the error between them.
【Data Generation Code Example】
import numpy as np

import pandas as pd

# Generate a time series of sales data

dates = pd.date_range(start="2023-01-01", periods=100, freq="D")

sales = np.sin(np.linspace(0, 3.14, 100)) * 50 + np.random.normal(0, 10,


100) + 200

# Create a DataFrame with the generated data


df = pd.DataFrame({"Date": dates, "Sales": sales})
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

# Generate a time series of sales data

dates = pd.date_range(start="2023-01-01", periods=100, freq="D")

sales = np.sin(np.linspace(0, 3.14, 100)) * 50 + np.random.normal(0, 10,


100) + 200

# Create a DataFrame with the generated data


df = pd.DataFrame({"Date": dates, "Sales": sales})

# Convert the 'Date' to numerical format for modeling

df['Date_Num'] = (df['Date'] - df['Date'].min()).dt.days

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(df['Date_Num'],


df['Sales'], test_size=0.2, shuffle=False)

# Reshape the data for Linear Regression


X_train = np.array(X_train).reshape(-1, 1)

X_test = np.array(X_test).reshape(-1, 1)

# Train a linear regression model

model = LinearRegression()

model.fit(X_train, y_train)

# Predict future sales

y_pred = model.predict(X_test)

# Plot the actual vs predicted sales and the errors

plt.figure(figsize=(10,6))

# Plot actual sales

plt.plot(df['Date'][-len(y_test):], y_test, label="Actual Sales",


color="blue")

# Plot predicted sales

plt.plot(df['Date'][-len(y_test):], y_pred, label="Predicted Sales",


color="red")
# Plot error as a separate line

errors = y_test - y_pred

plt.plot(df['Date'][-len(y_test):], errors, label="Prediction Error",


color="green", linestyle="--")

plt.xlabel('Date')

plt.ylabel('Sales')

plt.title('Actual vs Predicted Sales and Forecasting Errors')

plt.legend()

plt.grid(True)

plt.show()

This exercise walks you through building a time series forecasting model
and visualizing the errors in the predictions.First, we generate synthetic
sales data using a sine wave and random noise to simulate realistic sales
fluctuations. The pd.date_range function is used to generate daily
timestamps, and sales are modeled as a mix of sinusoidal trends and random
noise.Once we have the data, we convert the Date column into numerical
values (days since the first date) so that the linear regression model can use
it as a feature. Time series data typically needs to be numerically
represented for regression models.We then split the data into training and
testing sets. Since time series data relies on temporal order, we do not
shuffle the data, as this would break the temporal correlation. The training
set consists of the earlier part of the data, and the test set consists of the
more recent part.The model is a simple LinearRegression from scikit-learn.
It is trained on the date (in numerical format) and sales values. After
training, we predict sales for the test period.The key focus is the
visualization of errors. After plotting both the actual and predicted sales, we
compute the difference (error) between actual and predicted values. This
error is plotted as a dashed line to highlight where and how much the
prediction deviates from reality.This exercise not only emphasizes
forecasting but also visual error analysis, which is critical when evaluating
the performance of time series models. Understanding these errors helps
improve models and provides insights into trends that the model may have
missed.
【Trivia】
Did you know that linear regression, one of the simplest machine learning
algorithms, has been around since the 1800s? It was originally developed
by Francis Galton to study the inheritance of traits and was later expanded
by Karl Pearson. Despite its simplicity, it is still widely used today for
various types of predictive modeling, including time series forecasting!
88. Visualizing the Results of a Machine Learning
Classification Task
Importance★★★★☆
Difficulty★★★☆☆
You are working for a company that wants to predict customer churn based
on user data.The goal is to train a machine learning model to classify
whether a customer will churn or not, based on certain features.After
training the model, the company wants to visualize how well the model
performs using a confusion matrix and classification report.Your task is
to:Create sample data for customer churn prediction (e.g., 'age',
'monthly_spending', 'years_with_company').Train a machine learning
model on this data.Visualize the results using a confusion matrix and show
the model’s performance through a classification report.
【Data Generation Code Example】
import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.datasets import make_classification

# # Generate input data for churn prediction #

X, y = make_classification(n_samples=500, n_features=4, n_classes=2,


random_state=42)

# # Split the data #

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)
【Diagram Answer】

【Code Answer】
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.datasets import make_classification

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import confusion_matrix, classification_report,


ConfusionMatrixDisplay
# # Generate input data for churn prediction #

X, y = make_classification(n_samples=500, n_features=4, n_classes=2,


random_state=42)

# # Split the data #

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

# # Train a machine learning model #

model = RandomForestClassifier()

model.fit(X_train, y_train)

# # Make predictions #

y_pred = model.predict(X_test)

# # Compute confusion matrix #

cm = confusion_matrix(y_test, y_pred)

# # Display confusion matrix #

ConfusionMatrixDisplay(confusion_matrix=cm).plot()

plt.title('Confusion Matrix')

plt.show()

# # Display classification report #

print('Classification Report:\n', classification_report(y_test, y_pred))

In this exercise, we simulate a customer churn prediction task using a


classification model.First, the data is generated using the
make_classification function, which creates a synthetic dataset with four
features and two classes.We use the train_test_split function to split the data
into training and testing sets, with 70% of the data used for training the
model and 30% for testing it.Next, we train a RandomForestClassifier,
which is a powerful ensemble method that constructs multiple decision
trees and averages their results to make predictions.After training, the
model is used to make predictions on the test data, and we then evaluate the
results by calculating a confusion matrix and a classification report.The
confusion matrix provides a detailed look at the model's prediction results,
showing how many correct and incorrect predictions were made for each
class.To visualize this, we use ConfusionMatrixDisplay, which allows us to
plot the matrix as a heatmap.The classification report summarizes
additional performance metrics, such as precision, recall, F1-score, and
support, for each class.These metrics help us understand how well the
model performed overall and for each class individually.The
RandomForestClassifier model is well-suited for tasks like churn prediction
because it can handle complex data relationships and reduce
overfitting.This exercise helps beginners practice basic machine learning
tasks like training a model, making predictions, and evaluating results
visually and statistically.
【Trivia】
Random Forest classifiers work by building multiple decision trees during
training and combining their outputs to improve the model's predictive
accuracy.This technique, known as "bagging," helps to reduce variance and
avoid overfitting, making Random Forests one of the most widely used
models for classification tasks in various industries.
Chapter 4 Request for review evaluation

Thank you for reading through to the end.


I sincerely hope this collection of 100 exercises has helped deepen your
understanding of machine learning with Python.
I designed this book for those who already have a basic grasp of
programming and wanted to tackle more practical, hands-on problems.
The inclusion of source code execution results and detailed explanations is
meant to make the learning experience smoother and more insightful.
I hope this approach worked well for you.Now, if I may, I’d like to ask for
your feedback.
Whether you found this book useful, enjoyable, or even if it fell short of
your expectations, your opinion matters a lot.
It helps me grow as a writer, and more importantly, it helps future readers.
Did you find the explanations clear? Were the examples too challenging or
not challenging enough? If there were areas that felt confusing or
incomplete, I would love to hear about them.
This is how I learn and improve, continuously refining my work based on
what real readers like you experience.If you are pressed for time, even just
leaving a quick star rating would mean the world to me.
I read every single review and try my best to reflect on them for future
editions.
This is not just about a number, but about understanding how the material
resonates with readers.
If you have any suggestions for future topics or specific areas of machine
learning that you’d like to explore, please feel free to mention those as
well.I am incredibly grateful that you spent your valuable time with this
book.
I genuinely hope that it was a helpful resource for your learning journey.
And if you have any questions or suggestions for improving the book, don't
hesitate to reach out—I’m always eager to hear from readers.Finally, I look
forward to connecting with you in future works.
Thank you so much for your support, and I wish you all the best in your
continued exploration of Python and machine learning.
Appendix: Execution Environment
In this eBook, we will use Google Colab to run Python code.
Google Colab is a free Python execution environment that runs in your
browser.
Below are the steps to use Google Colab to execute Python code.

Log in with a Google account


First, log in to your Google account. If you don't have an account yet,
you need to create a new one.
Access Google Colab
Open your web browser and go to the following URL:
http://colab.research.google.com
Create a new notebook
Once the Google Colab homepage appears, click the "New Notebook"
button. This will create a new Python notebook.
Enter Python code
Enter Python code in the cell of the notebook. For example, enter the
following simple code:
print("Hello, Google Colab!")
Run the code
To run the code, click the play button (▶️) on the left side of the code
cell or select the cell and press Shift+Enter.
Check the execution result
If the code runs successfully, the result will be displayed below the cell.
In the above example, "Hello, Google Colab!" will be displayed.
Save the notebook
To save the notebook, select "Save to Drive" from the "File" menu at the
top of the screen. The notebook will be saved to your Google Drive.
Install libraries
If you need any Python libraries, enter the following in a cell and run it:
!pip install library-name
For example, to install numpy, do the following:
!pip install numpy
Open an existing notebook
To open an existing notebook, select the notebook from Google Drive or
choose "Open Notebook" from the "File" menu in Colab.
These are the steps to run Python code on Google Colab. With this, you can
easily use a Python execution environment in the cloud.

You might also like